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Abstract 

The Morph system provides a framework for automatic 
collection and management of profile information and 
application of profile-driven optimizations. In this paper, we 
focus on the operating system support that is required to 
collect and manage profile information on an end-user's 
workstation in an automatic . continuous, and transparent 
manner. Our implementation for a Digital Alpha machine 
running Digital UNK 4.0 achieves run-time overheads of less 
than 0.3% during profile collection. Through the application 
of three code layout optimizations, we further show that 
Morph can use statistical profiles to improve application 
performance. With appropriate system support, automatic 
profiling and optimization is both possible and effective, 

1. Introduction 

Morph is a combination of operating system and compiler 
technology that provides a practical framework for the 
advanced compiler optimizations needed to support continued 
improvements in application performance. Morph is practical 
because it provides system support for the automatic 
application of profile-driven compiler optimizations on an 
end -user's system. The major obstacle to the use of these 
optimizations is obtaining timely high-quality profile 
information. Ideally, such profiles should be representative of 
how a program is used on a specific machine by a specific 
user. Given the current infrastructure for compilation, it is not 
practical to obuin and use such profiles, as typical end-users 
do not know how to generate profile information and would 
not be able to optimize the application if they did. The Morph 
system addresses this problem, using operating system 
support to collect, process, and apply profile information to 
rc-optimize applications automatically. 

This paper describes our implementation of Morph for the 
Digital Alpha processor and Digital UNDC. Our 
implementation is composed of several components. The 
Morph Monitor is an operating system kernel component that 
implements continuous, low overhead profiling and program 
monitoring. The Morph Editor is a compiler component that 

Permitiion to make digital/hard copy of part or all this work for 

p«rBonal or cJassroom u«* m granted without feo provided that 
copia* sro not mad* or diatributad for profit or commarcisi advarw 
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implements rc-optimization, transforming compiler 
intermediate form into an executable. The Morph Manager is 
a system component that manages profile information, 
including the automatic invocation of rc-optimization. In this 
paper we focus on the systems components, the Monitor and 
the Manager. Although some discussion of the Morph Editor 
is needed to provide context for the rest of the work, detailed 
description of specific compiler optimizations is outside the 
scope of this paper. 

We designed our system to meet a number of requirements 
that we believe are necessary for mainstream systems and 
users: 

• Optimization should happen on the machine where the 
software is used For complex interactive applications, 
there can be substantial variation in how the application is 
used by different individuals. Also, binaries for popular 
applications run on a broad range of binary-compatible 
systems with widely varying internal architecture and 
performance characteristics. Given these sources of 
variation between systems, profile-driven optimization 
will be most effective if it can incorporate information 
specific to the end-user and the user's machine. 

• Optimization should not require source code. Although 
recompilation from source code may be a reasonable 
alternative for public-domain and low- volume software, it 
is not practical for the commodity software market. To 
enable optimization without source code, we assume that 
an intermediate -form representation of the program can be 
derived from the program distribution media, with all 
optimizations based on this intermediate form. In practice, 
there are a number of alternatives for achieving this, 
including executable editing techniques [LS95, RVL97] 
and compact program representations (FK96]. 

• Optimization must be transparent to the user. We assume 
that the people who use computers are non -specialists with 
no background in computer science or compilation. 
Optimization should be transparent in that the user does 
not need to understand or actively participate in the 
optimization process. Transparency also implies chat the 
system be robust and provide consistent performance 
benefits with no negative performance impact. 

The Morph system is designed to address alt these 
requirements. It automatically collects profiles and uses them 
to optimize programs on the machine where the software is 
used. The optimizations transform an intermediate form 
representation of the program to a new executable without 
requiring source code. As an intermediate form representation 
is not shipped with current applications, it must be generated 
from source in our system. Morph makes it possible to re- 
optimize an application using a small number of simple. 
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mechanical steps. Our results show that Moipb is an effective 
platform for profile-based optimization. We also show that 
the overhead of Morph profile collection and management is 
negligible. This paper describes a design and implementation 
of Morph, and gives performance results thai illustrate the 
measured benefit of profile-driven optimization on our testbed 
system. 

1.1 Related Work 

There are a number of research projects in low-overhead 
profiling and late optimization. In this section we review 
some relevant projects and explain how they differ from 
Motph. 

Morph enables executable programs to evolve with changes in 
program usage patterns and computer hardware. This 
functionality is not provided by any current tool, although 
Morph builds on technology and techniques originally 
developed for program instrumentation and measurement. 
Several existing tools (Pixie fSmi91]. ATOM tSE94], EEL 
fLS95]) rewrite executables for the purpose of program 
measurement. These instrumentation tools differ from Morph 
in two important ways. First, though the overhead of 
instru mention-based profiling has improved* yielding 
slowdowns as low as 10%. this level of overhead is still too 
large for continuous nwnitoring. frequently exceeding the 
benefit of optimization. In Morph. we avoid this overhead 
through statistical profile sampling in the operating system. In 
Section 4.3, we show thai the on-line overhead for sample 
collection in Morph is significantly less than 1%. A further 
difference between Morph and earlier profiling tools is that 
Morph is designed specifically for automatic profiling and 
optimization. 

Morph applies profile -driven, machine-specific optinruzations 
to intermediate representations of programs and libraries. Like 
Morph. three prior systems from Digital Equipment 
Corporation (Mahler PW87). Epoxie [Wal92], OM [SW92]) 
also performed optimizations on programs after compilation. 
These tools were designed for UNIX systems and required 
cither an intermediate form representation of the program or 
an executable that includes relocation information. More 
recent systems for Windows NT include Spike [CL96, 
Goo971 and Etch rRVL97]. Both implement profile-based 
optimizations. Etch for the Intel x86 architecture and Spike 
for Digital Alpha-based systems. 

The Digital Continuous Profiling Infrastructure (DCPI). 
developed at the Digital Systems Research Center, supports 
continuous monitoring of the entire system including the 
kernel fABD97]. There are substantial similarities between 
DCPI profile collection and the continuous monitor used in 
Morph. Both DCPI and the Morph Monitor use statistical 
sampling to collect profile information of all system activity 
with very low overhead. The primary difference between the 
two monitors is how they have been applied. In Morph we 
have focused on optimization, whereas the emphasis to dale 
for DCPI has. been on . performance analysis: It would be 

relatively straightforward to use DCPI as a source of profiles 
for Morph. 

Come ct al. [CMH96) describe an approach in which new 
hardware support in the form of a profile buffer is used to 
collect statistics describing the behavior of conditional 




branches in the program. They iise the resulting information 
to drive a superblock- scheduling optimfzaiion. Although the 
Morph design does not preclude the use of new hardware 
support for profile collection, it docs not require it either. 

DigitaKs FX!32 is a system that provides transparent 
execution of x86 Win32 applications on Windows NT Alpha 
systems [CH97, Rub96]. As a part of their emulation system, 
execution profiles of x86 code are collected and fed to a 
background process that translates the previously emulated 
portions of x86 binaries into native Alpha code. Though the 
goal of Digital FX!32 is different from Morph, the two 
systems share many similarities: both systems continuously 
collea profile information of executables in a transparent 
manner and use the profiles to guide a code re -writing 
operation. The main differences between the two systems are 
that FX!32 works directly on executables, it collects profile 
samples during program emulation (rather than direct 
execution) of the application, and that it emphasizes code 
translation, rather than code optimization as in Morph. 

In the next section, we discuss the design goals for Morph. 
Section 3 then describes our implementation. In Section 4, we 
report performance results, both in terms of optimization and 
overhead. Section 5 discusses future research directions and 
other issues in making the system practical for widespread 
use. Conclusions follow in Section 6. 

2. Motivation for System Design 

Morph provides a framework for profile -based and host- 
specific optimization. An important feature of Morph is that 
optimization occurs on the system where the applications are 
run. The final stage of optimization occurs after the end-user 
has already installed and used the application. This makes it 
possible for optimization to incorporate the idiosyncratic 
usage patterns of a specific user and track changes in the way 
a program is used. It also means that all details of the host 
hardware can be used during optimization. Several 
observations about profile data and machine description 
information serve to clarify the potential advantages of the 
late application of host-specific optimizations. 

Profile information describes how an application is used by an 
end-user. The trend in modem applications is to implement a 
large number of features and to hide the complexity of feature 
selection behind a rich graphical user interface. I^r such 
applications, the subset of features used by different 
individuals can vary substantially. This will tend to increase 
the need for profile-driven optimization and the value of 
collecting profile information as close as possible to the end- 
user. Also, the way an individual uses a complex interaaive 
application can evolve over time. For example, a user may 
discover a new feature in a user- interface and make regular 
use of it from that time on. This suggests a situation in which 
profile-based pptimizations must be continuously re-applied 
in order to achieve their full benefit This variation may not be 
significant for batch-style UNIX workloads such as those 
used in this paper. Nonetheless, the techniques we present 
facilitate exploration of these issues for more complex 
interactive workloads. 

The second type of information used by host-specific 
optimizations describes a specific implementation of a 
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mactiine architecture. Many modem compilcf'Tjptimizations 
depend on information that is specific to an implementation of 
an instruction set architecture and is not expressed in the 
instruction-set description, such as: 

• Pipeline organization. This information directs instruction 
scheduling. 

• Instruction timing. Often, the relative cost of an 
instruction sequence changes from one machine 
implementation to the next. This information influences 
the instruction selection and scheduling processes. 

• Cache parameters. This information can be used by 
locality optimizations such as data blocking [LRWQI] and 
code layout rPH90). 

These hardware parameters commonly vary between binary- 
compatible systems, even between those designed at the same 
time in similar VLSI technologies. In the absence of a suitable 
infrasinicture for host-specific optimization, hardware 
designers have created machines that attempt to execute 
programs efficiently without help from the compiler. An 
example is the current trend towards out-of-order instruction 
execution, in which the hardware does instruction scheduling 
on the fly rather than relying on software scheduling. While 
this has the advantage of providing performance 
improvements for legacy software, it has the disadvantages of 
increased hardware complexity and sub-optimal overall 
performance. In general, the lack of a practical infrastructure 
for host-specific optimization leads to increased hardware 
complexity, even when simpler hardware alternatives that 
require minimal compiler suppon exist. In spite of the lack of 
a workable compiler infrastructure, the need for host- specific 
optimizations is increasing in modem machines. Experts in 
the computer architecture and compiler communities 
anticipate that the importance of these optimizations will 
increase for the next generation of machines [Chr96. Gwe96I. 

3. Implementation 

Our prototype implementation of Morph supports automatic 
profile generation and re-optimization for Digital Alpha-based 
workstations running Digital UNIX. We employed several 
custom system tools and made small modifications to Digital 
UNIX to support automatic profile generation, collection, and 
analysis. The compiler components of Morph are based on the 
SUIF research compiler infrastructure available from Stanford 
University [SUIF94]. We have extended this infrastructure to 
support profile-driven, machine-specific optimizations 
[Smi96]. We are planning support for Windows NT and the 
x86 ISA. as a step towards optimization of more complex 
interactive applications. Practical considerations led us to 
choose rhe UNIX/ Alpha platform for our initial 
implementation, including access to source code for Digital 
UNIX. In Section 3.1. wc introduce rlxc components of the 
Morph system and describe how they are incorporated into a 
complete system for automatic program optimization. The 
subsections that follow focus on the main systems 
components of Morph, namely the Morph Monitor for profile 
sample collection (Section 3.2), the Morph Mainager that 
provides off-lin£ profile processing (Section 3.3). and the' 
Morph Editor that implemenu optimizations (Section 3.4), 



3.1. Overview 

Our initial design targets a single-user workstation 
environment. Specifically, we assume that each user has his 
or her own machine, and that all instructions executed on the 
machine are run on behalf of that user. We assume that each 
user has a private copy of the applications they use. Finally, 
we assume that there are substantial periods of idle time. 
These assumptions are appropriate for most worksutions and 
personal computers. For machines that do not satisfy these 
assumptions, additional engineering is required. For example, 
in an environment where users share a single copy of an 
executable file from a file server, the optimized version of the 
module could either be created on a per-user basis or be 
shared by all users. The former requires more disk space and 
the latter requires distributed profile collection. Both require 
system support comparable to that described in this paper. 




your computer^ 



Figur e L The Structure of Morph. 

The major coniponenis of Morph are represented 
schematically in Figure 1 . Optimizations in Morph are applied 
without tlie use of source code. To achieve this goal we posit 
an application distribution format that includes a compact 
representation of the compiler intermediate form. Compiler 
optimizations are implemented as SUIF passes which apply 
transformations to this intermediate form. An important 
aspect of our design is that executable modules are machine 
and OS specific. Though we are aware of prior work on 
machine and operating system-neutral program distribution 
formats [OSF93, Col95], we have chosen not to attempt 
similar functionality in Morph, focusing instead on the 
problem of using profile-driven optimizations to improve 
performance. 

The key to providing a viable framework for host-specific 
optimization is to define a partitioning of the compilation 
process that supports efficient and effective retargeting. The 
link between the two phases of compiiaiicn is the Morph 
executable, an executable that includes the supplementary 
information required for , the second, host-specific stage of 
compilation. Conceptually, a Morph executable consists of 
two parts. The first part is equivalent to executables in today's 
systems. This part could be directly executed on the target 
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hardware. The second part, specific to the Morph system, 
contains the extra information necessary to re -optimize the 
executable. It includes information that allows recovery of the 
program intermediate form, as well as the additional Morph 
annotations required to support robust and genera] executable 
rewriting. In our prototype, we maintain these two 
components in separate files, one for the executable program, 
and another containing the complete intermediate form 
representation of the program. 

Morph has five major software components: the Morph Back- 
end, the Morph Manager, the Morph Editor, PostMorph, and 
the Morph Monitor. The Morph Back-end produces 
executable programs and shared libraries thai include the 
annotations Morph needs to support efficient late 
optimization. The Morph Manager is a user-level daemon that 
provides off-line profile management as well as profile 
analysis to decide when to invoke the Morph Editor and 
PostMorph. The Morph Editor. implements the optimizations 
provided by Morph. The current version of cur Editor 
supports three profile-based optimizations. Two of the 
optimizations, procedure layout and 'fluff removal", are 
designed to improve locality in the instruction reference 
stream [PH90]. The third is a basic block ordering 
optimization that improves both branch prediction and 
instruction reference locality [PH90. YJK973, The Morph 
Monitor collects profile information for use by the Morph 
Editor. The Monitor collects profile samples with very low 
overhead, leaving subsequent processing to the Morph 
Manager PostMorph provides a means for extending the life 
of legacy exccutables by inferring Morph annotations through 
sutic and dynamic analysis so they can be retargeted by the 
Morph Editor. 

3.2 On-line Profile Collection: The Morph 
M nitor *^ 

Profile collection is implemented as a two-phase process: on- 
line sample collection in the Morph Monitor, followed by off- 
line profile analysis and management in the Morph Manager. 
This separation allows us to minimize the on-line overhead of 
profile collection by deferring as much work as possible until 
off-line processing. In this section we describe the design of 
the Morph Monitor and some of the interesting problems we 
encountered in its implementation. 

One key goal of the Morph Monitor is low profiling overhead. 
Our monitor achieves this goal by using statistical sampling 
of program activity, rather than collecting a complete profile. 
In most current systems, profile information is typically 
obtained by executing an instrumented version of the 
application that generates profile information as a side-effect 
of program execution fSmi91, SE94, LS95). A 
straightforward approach for automatic profile collection 
would be periodic use of a higher-overhead profiling system, 
such as an instrumentation-based profiler. Although this 
approach has the advantage that it is well understood, the 
performance overhead incurred during profiled runs of 
application would be substantial and in many cases would be 
perceptible by users. Also, if profile collection were to occur 
only part of the time, it would be difficult to insure that the 
resulting profiles were representative of typical program use. 
By collecting profile information continuously for all activity 




on the system, we assure that the profiles accurately reflea 
how the software is used. 

The Monitor is implemented as a pseudo-device and a small 
number of local modifications to the Digiul UNIX 4.0 kernel. 
Profile samples are collected using a modified version of the 
clock interrupt routine, hardclock(U which records a program 
counter (PC) sample to a sample buffer on each clock 
interrupt. Idle- loop activity is automatically recognized and 
eliminated during profile collection. Several additional kernel 
modifications provide supplementary information needed to 
interpret the profile samples. We modified the exec() system 
call to provide information about the initialization of address 
spaces and to selectively enable and disable sampling. We 
also fixxlined the mmapO system call, which is used by the 
user-level runtime to map executable files and shared libraries 
into a user address space. We modified the exit() system call 
to log process termination events. This makes it possible to 
identify when an address space has terminated and the 
corresponding mapping information can be discarded. We 
also modified the kernel lb log context switch information so 
that we can interpret each PC sample in the context of the 
address space in which it was generated. 

The kernel sample buffer is statically allocated and is 256 KB 
in our current system. Periodically, the contents are 
transferred to the Morph Manager using a UNfDC pseudo- 
device. With 8 bytes per sample and a clock interrupt rate of 
1024 Hz, this allows for approximately 30 seconds of profile 
samples between daemon invocations. The sample buffer is 
organized as a ring buffer with sample collection continuing 
during the dump of the sample buffer to disk. 

Our current monitor presents a number of opportunities for 
tuning which we have not yet exploited. These include 
reducing the number of bytes required to store a profile 
sample and adjusting the tradeoff between sample buffer size 
and the frequency of writing samples to disk. The overhead of 
our current profiler satisfies our main subjective performance 
goal: it is unnoticeable. Measurements of our current Monitor 
implementation show the overhead of profiling is typically 
only about 0.1-0.3% (Section 4.3). Given this level of 
performance, we have chosen to defer further work on 
reducing profiling overhead and focus on the other 
functionality required for automatic optimization. 

A potential pitfall of sampling based on clock interrupts is 
that activity synchronized with clock interrupts might be 
missed. This problem could be avoided by introducing 
randomness into the sampling interval, using an intenupt 
based on a countdown timer instead of (or in addition to) 
sampling during clock interrupts. This scheme is used by the 
continuous profiler in DCPI [ABD97}. Many popular 
processors provide appropriate hardware support including 
the Digital Alpha and the Intel Pentium Pro. Using a version 
of the Morph Monitor that supports random sampling 
interval 5, we determined that sampling on clock interrupts is 
appropriate for the benchmarks used in this paper. 

3.3 Off-line Profile Processing: The M rph 
Manager 

The Morph Manager is implemented as a user-level daemon 
that periodically reads and processes the raw profile samples 
generated by the Monitor. Whereas the Monitor is responsible 
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tor on-iinc protilc sample colleciion. the Man3|CT performs 
off-line profile management and processing. The goals of this 
processing are twofold: to- decide when to invoice an 
optimization and to .transform the raw.* PC samples into a 
compact form that can be used directly by optimizations. In 
our work to date we have focused on transforming profile data 
for use by optimizations. 

From the point of view of optimization, applications art 
composed of multiple modules, typically with one executable 
flic and multiple dynamically linked libraries (DLLs). Each 
use of a module results in some number (possibly zero) of 
Monitor samples. We use the term sample set to refer to the 
PC samples resulting from activity in a single module for a 
single process. A single sample set may not include enough 
information to achieve the full potential benefit of a given 
optimization. This makes it desirable to collect and combine 
multiple sample sets into a single composite profile for the 
module. 

The Manager identifies executable modules in the profile 
sample log by their path-name in the file system. If the 
corresponding file is updated for some other reason than rc- 
optimization (for example, a program update), the system 
needs to recognize this event, so it»at profile samples can be 
interpreted in the correct conicxt. A straightforward scheme 
for implementing this is to discard profile samples collected 
during a period in which the corresponding module was 
updated. A more sophisticated scheme could use a 
cryptographically secure checksum of the module, that would 
be recorded in the profile log and checked to verify the 
correspondence between the module that generated the profile 
samples and the module currently on disk. We expect that the 
simple scheme will be adequate in most situations. 
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Figure 2: Sample scaling for two basic blocks. 



In the conversion from profile sample sets to a basic block 
profile, each PC sample is a wirjicss for the basic block 
containing it. Morph provides the information needed to 
identify basic block boundaries in terms of instruction 
addresses and to map these address ranges to basic blocks in 
the intermediate form representation of the executable 



module. This mapping information is required for the use of 
profiles based on PC values, such as those obtained through 
statisdcal sampling. 

The PC samples must be scaled when incrementing the 
counters for basic blocks. To understand the need for scaling, 
consider two basic blocks, one containing three instructions, 
one containing six Instructions, and both of which are 
executed the same number of times during a process. Figure 2 
illustrates this example. When processing profile samples, the 
Manager assumes two instructions that arc executed the same 
number of times have the same probability of being sampled. 
If we increment the pcr-basic block counter by one for each 
sample, the resulting profile would indicate (incorrectly) that 
block B was executed twice as often as block A. Scaling the 
increment by the inverse of the basic block length gives better 
correlation between the weights in the profile and the number 
of times the basic block was executed. Differences in the 
cycles required to execute individual instructions also affect 
the profiles. We will discuss these effects shortly. 

After transforming PC samples into a basic block profile, we 
combine the individual profiles from each run to generate a 
single basic block profile. The Manager sunu the counts from 
each program execution for each basic block to create an 
aggregate profile. This method of combining sample sets 
models (he actual usage of the application, as each run of the 
application is represented in the cumulative profile in 
proportion to the amount of execution time it required. During 
this conversion, profile samples for modules that have been 
modified or deleted are eliminated. 

The combining of profile data is a general problem for 
profile -based optimization and is not unique to our 
methodology. In conventional (non- automatic) profile-based 
opdmization, complete profiles are collected for a number of 
scripted training input sets. This is in contrast to continuous 
profiling of all activity as with Morph. Fisher and 
Freudenberger fFF92I evaluate three techniques for 
combining multiple branch profiles: combining profiles 
directly, normalizing to give each test case the same weight in 
the composite profile, and polling'. The Morph Manager uses 
direct combining, such that profiles contribute to the 
composite profile in proportion to their contribution to overall 
activity. We believe that direct combining is appropriate for a 
continuotjs monitor, even though normalization can give 
better results for scripted inputs [FF921. 

When a module is optimized, the profile information for the 
nwdule must also be transformed, to make it meaningful in 
the context of the new executable module. As our profiles are 
stored in terms of basic blocks in the program's intermediate 
form representation, this transformation requires a function 
that maps basic blocks in the old intermediate representation 
to basic blocks in the new intermediate representation. In 
Morph this mapping function must be generated as a by- 
product of optimization. 

Our profiles tend to give more weight to basic blocks with 
higher latency and higher average cycles . per instruction 
(CPI), as they have a higher relative probability of being 
sampled: we call these nme-basad profitss. In contrast. 



' Fisher and Freudenberger found that polling gave inferior rtrsutts. This 
technique will not be discussed funher. 
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profiles generated by insimmenicd program execution give 
equaj weight to all instnictions, regardless of their relative 
latency; we refer to these as frequency-based profiles. 
Because of this difference, time-based profiles and frequency- 
based profiles of the same activity arc not identical. We could 
attempt to adjust the time base in our statistical profiles to 
make them more similar to a frequency-based profile. 
However, our experience to date suggests that neither time 
nor frequency based profiles arc clearly superior. [GWZ97] 

3.4 The Morph Editor 

The Morph Editor applies optimizations using profiles from 
the Morph Monitor. The Editor is currently implemented as a 
composition of SUIF compiler passes which convert a 
program module in a low-SUIF intermediate form to 
optimized assembly language. For this paper, the Editor 
performed three profile-driven code layout optimizations. The 
input to the Editor has already been highly optimized using a 
set of typical scalar optimizations including dead -code 
removal, as well as machine- specific optimizations such as 
register allocation. The rest of this section provides a brief 
description of the optimizations implemented by the Editor. 
All the optimizations require machine -specific information 
and benefit from profile information, including basic block 
execution counts and control flow graph edge frequency 
counts. The Morph Monitor directly generates basic block 
counts, and the Editor analyzes these counts to synthesize the 
edge frequencies. The counts and frequencies do not have to 
be exact since the information is used only to direct 
optimization toward the most popular parts of the program. 
Since these optimizations make some paths through the 
application run faster at the expense of others, they perform 
better when profile information can be used in place of simple 
global heuristics (e.g. assuming forward branches taken and 
backward branches not-taken). 

The three code layout optimizations implemented in the 
Editor are branch alignment, fluff removal, and procedure 
layout. AH are based on previous work by Penis and Hansen 
(PH90]. The first optimization is branch alignment This intra- 
procedural optimization reorders the program basic blocks to 
reduce control penalties, branch mispredictions and 
misfetches. and to improve instruction cache locality. We 
implement a greedy version of this technique, described by 
Young ei al. [YJK97]. The algorithm considers the edge 
frequencies in order of decreasing weight and attempts to 
place the blocks al the ends of an edge adjacent to each other 
so that the control penalties arc minimized. Fluff removal 
follows branch alignment. This optimization uses the basic 
block counts to identify blocks of code (such as error handling 
code) that were not executed during the profiled run. Moving 
these blocks to the end of the text segment improves the 
density (spatial locality) of the remaining code. The relocation 
of fluff code requires additional jumps to insure that the code 
is still reachable*. 

Once we have compacted the executed portion of each 
procedure, we apply a procedure layout optimization to find a 




placement of the program profcedures in memory that 
minimizes the penalty cycles due to instruction cache conflict 
and capacity misses: We place all of the non-fluff procedure 
components before placing all of the fluff components. In 
particular, our implementation consimcts a weighted call 
graph from the block counts and uses it to identify procedure 
conflicts that incur the largest cache penalties. Our greedy 
algorithm attempts to minimize penalties by using a closest- 
is-best heuristic: if two procedures call each other frequently, 
the algorithm attempts to place them near each other in the 
code segment 

4. Experimental Results 

This section describes our experiments to evaluate automatic 
profile-based optimization under Morph. After describing the 
experimental system and worWoads, we present two sets of 
results. One set of experiments (Section 4.2) documents the 
effectiveness of profile-based optimization under Morph. In 
the second set of experiments (Section 4.3), we quantify and 
analyze the overheads in Morph. 

Overall our results show that the optimizations applied by 
Morph achieve substantial performance improvements for 
many of our test wortioads. Though the focus of this paper is 
not to explore the effectiveness of such optimizations, these 
results are important as justification for providing support for 
automatic optimization as part of an operating system. 
However, readers should keep in mind that the optimizations 
used in Morph are well understood by the compiler 
community, and the main criteria upon which Morph should 
be evaluated is its effectiveness as a framework for automatic 
optimization. 

4.1 Experimental DetaUs 

Our Morph prototype system used version 1 . 1 .2 of the 
Stanford SUIF compiler with Harvard Machine SUIF 
extensions (version 1.1.2). We ran our experinwnts on an 
AlphaStation 400 4/233. The system includes a S 12 KB 
second-level cache and 128 MB of main memory. Profile 
information was collected by sampling the program counter 
on every clock interrupt producing 1 024 samples per second. 

Our experimental system is based on Digital UNIX version 
4.0. The Morph Monitor is implemented as a set of 
modifications to the Digital UNIX kernel, with other Morph 
components mnning within the infrastructure provided by 
Digital UNIX. 



* There is clearly an inieraciion bciwecn insinjciion scheduling, branch 
alignmcni. -and flufr removal. Manapng this interaction is an area of on- 
going research in the compiler community. 



20 



DCCIO. <XP 77101BA_I_> 



V 



Benchmark 


Description 


Text 
(KB) 


public 
license 




file compression utility 


96 




video file ptaver 


248 J 


Digital 
UNIX 


sort 


data file sorting or mergingi. 
utility 


56 


5pec92 


espresso 


boolean function in!nimjzc*icn 


248 


Spec95 




oriental board Kame 


416 




ypeg 


JPEG inuge compression and 
decompression 


248 




ItSD 


LISP tanKuaee interpreter 


1 12 




perl 


Pert language intcrpi«ter 


488 




vortex 


object-oriented database 
transaction system 


800 



Table 1. Experimental Workloads. 



Benchmark 


Testing Input 


Training Input 1 


espresso 


tial.in from Spcc92 


seven other input files | 
from Spec92, repeated five 

ttlTlCS. 


go 


test input from Spec95 


ref and train inputs from 
Spec95 


gzip 


one 6.43MB file 


ten oth«r files with size 
between 0.47MB -1 1.7MB 


Upeg 


train input from Spec9S 


ref and test inputs from 
Spec95 


lisp 


test input from Spec95 
(S-queen problem) 


nef and trrJn inputs from 
Spec95 


rapegjiay 


one video file 
downloaded from the 
Internet 


ten other video files 
downloaded from the 
Internet 


peri 


test input {jumble.pt) 
from Spec9S 


ref input {pritnes.pl and 
scrabbLph from Spcc95 


sort 


sort a 0.75MB data file 


ten other sorting tosics 


vortex 


train input from Spcc95 


test input from Spcc95 


Table 2. Testing and Training inputs. 








% of total activity 


Time ( in seconds) 




Program 


app 


lib 


kernel 


test 


train 




espresso 


90.7 


8.4 


0.9 


11.3 


23.8 






99.7 


0.2 


0.1 


354.7 


1917.7 




Rzip 


97.5 


0.3 


2.2 


25.5 


184.7 






98.5 


1.3 


OA 


20.6 


1284.0 




liSD 


98.9 


0.9 


0.3 


17.8 


1009.5 




mpec_plav 


93.5 


5.6 


0.8 


18.7 


99.2 




perl 


83.5 


1 2.1 


4.4 


43.7 


799.5 




sort 


53.6 


41.8 


4.6 


13.4 


79.8 




vortex 


91.2 


8.3 


0.5 


61.6 


238.6 





Table 3. Application Time Distribution, This table gives 
testing and training times for the experimental v^orkloads. 
and a breakdown of training time among the application, 
shared libraries, and the operating system kernel. The time 
breakdown is based on profile samples from the benchmark 
training run. The execution time is measured using the 
system clock, which has 17 ms resolution. The lime 
distributions are computed from the profile samples; e.g. 
0.1% of the profile samples for go were PCs values from the 
kernel. 

We used a suite of nine workloads for this study. Table 1 
gives a description of each woricload, and Table 2 describes 
the training and testing input sets we used. Table 3 gives 
summary statistics for the workloads, including both 




execution times for the test inputs and total execution time for 
all training inputs. All experiments for this paper were run in 
single-user mode with a warm file system cache. Two factors 
limited our choice of workloads. First, any potential 
benchmark application had to be compiled with SUIF. As a 
research compiler, SUIF does not support the full set of 
language idioms supported by gcc or commercial C 
compilers, and makes less efficient use of CPU time and 
memory during compilation. As a result we were unable to 
compile many popular public domain UNIX applications. We 
also excluded applications that are I/O bound or spend most 
of their time in the kernel, as they will not benefit 
significantly from re -optimization. 




Figure 3. OptimizatiorB Results: This figure shows the 
reduction in execution time for programs optimized by 
Morph as a percentage of the execution time of the original 
program. AH execution times are the average of ten runs. The 
improvement in execution time for complete profiles is 
included for reference. 

4.2 Optimization Results 

Figure 3 shows optimization results for the nine test 
workloads. Optimization using profiles gathered by the 
Morph system improves application performance by up to 
27% depending on the application. The results labeled 
"statistical profiles" in Figure 3 were generated by combining 
profile sanr:ples from the training runs of a workload, 
generating an optimized executable using the Morph Editor, 
and then measuring the execution time of the optimized 
executable on the "test" input data set. Our experience with 
Morph suggest that minimal training is needed to obtain most 
of the benefit of optimization. For a detailed discussion of our 
findings, sec IGWZ97). The execution time improvements 
given in Figure 3 arc computed from the average of ten runs, 
with execution times measured using the Alpha cycle counter. 
The standard deviation for all workloads was less than 0.3% 
with the exception of go (1.8%). 




To compare peifonnance improvements for automatic 
optimization in Morph to conventional optimization 
techniques using complete profiles, Rgure 3 also includes a 
set of optimization results using complete profiles from the 
execution of instrumented workloads. Overall, the 
optimization results show that the potential benefit of profile- 
based optimization is significant and that automatic profiling 
and optimization with Morph can effectively achieve those 
benefits. 

4.3 Overhead 

We classify overhead in Morph according to the three steps in 
automatic optimization: on-line profile collection, off-line 
profile processing, and optimization. In this section, we 
evaluate the overhead of profile collection and profile 
processing. We do not discuss the overhead of the SUIF 
optimizations performed by the Morph Editor. SUtF is a 
research compiler, and is designed for flexibility rather than 
short compilation times. Ebcioglu and Altman [EA97] 
dcscribr a dynamic compilation system designed specifically 
for com^;iIation speed while still performing machine -specific 
optinii7 ions such as global instruction scheduling. They 
report ; t their compiler requires 15 times fewer instructions 
per gen. ated instruction than gcc. Similar techniques could 
be applied to the Morph Editor to make the cost of 
optimizanon low. 

4.3.1 Oii-iine Profiting Overhead 

Overhead from on-line profile collection comes from two 
sources: the time to record a PC sample into the in-memory 
profile buffer and the time required to periodically flush the 
profile buffer to disk. A problem that arises when attempting 
to measure overheads on the order of 0.1% for realistic 
workloads is that the overhead is smaller than the variation 
that can occur due to non-detemiinism in the system. A major 
source of such non-determinism is virtual-to-physical page 
mapping (KH92]. The standard Digiul UNIX kernel uses a 
bin-hopping policy for allocation of physical memory pages. 
Although this policy has some desirable performance 
characteristics [KH92J, it has the undesirable property of poor 
repeatability. For our overhead measurement experiments, we 
replaced the bin-hopping policy with page coloring [TDP^O], 
a deterministic policy, to improve the repeatability of our 
experiments. All results reported in this section use the page 
coloring policy. The problems and pitfalls of page mapping 
algorithms are well documented in the research literature 
rKH92, CB93, BCL94]. 

Table 4 gives the overhead of the on-line component of 
profiling for our test workloads. To distinguish tlie impact of 
profiling from memory system effects, we compare execution 
times for the Morph kernel with profiling and the Morph 
kernel with profile collection disabled. Disabling profile 
collection eliminates most of the instruction overhead for the 
Monitor and prevents profile sample generation, while 
preserving code displacements in kernel memory. We also 
present execution iimes for a standard Digital UNIX 4.0 
kernel. Note that the standard kernel can give slightly better 
or slightly worse performance than the Morph kernel with, 
profihng disabled. This is an indication of the impact of 




moving code within the kernel on execution time for the test 
workloads. 



Progrtnn 


DU 


MoTph-off 


Morph 


Overhead 


espresso 


12.051 


12.133 


12.155 


0.18% 




(0.001) 


(0.001) 


(0.003) 




gc 


279.724 


282.045 


282.498 


0.16% 




(0.193) 


(0.234) 


(0.170) 




grip 


25.275 


24.953 


24.981 


0.11% 




(0.010) 


(0.009) 


(0.007) 




Upeg 


20.770 


20.740 


20.770 


0.14% 




(0.005) 


(0.004) 


fO.0O6) 




lisp 


19.319 


19.296 


19.322 


0.13% 




(0.001) 


(0.001) 


(0.004) 




mpegj>lay 


19.262 


19.309 


19.341 


0.17% 




(0.001) 


(0.001) 


(0.004) 




peri 


46.977 


46.253 


46.357 


0.22% 




(0.043) 


(0.017) 


(0.020) 




sort 


13.591 


13.614 


13.624 


0.07% 




(0.010) 


(0.009) 


(0.010) 




vortex 


59.324 


59.840 


59.982 


0.24% 




(0.006) 


(0.014) 


(0.015) 





Program 


DU 


Morph'Ojf 


Morph 


Overhead 


strawmao 


86.38 


86.38 


86.45 


0.07% 




(0.00) 


(0.00) 


(0.00) 




stmwcian 


67.97 


68.38 


68.73 


0.50% 


512KB 


(O.OI) 


(0.00) 


(0.01) 




straw man 


82.11 


82.17 


82.31 


0.17% 


1MB 


(0.00) 


(0.00) 


(0.01) 





Table 5. On-line overhead for strawmao tests. This table 
shows execution limes and profiling overhead for the three 
str^wman tests: The experimental operating systems are the 
same as for Table 4. Note that the number of iieraiions 
pciformed by these usts is not the same. Strawman<lMS has 
a very high CPI (about 18) making it impractical and 
unnecessary to run a longer test. 

To quantify the overhead of profiling, we compare benchmark 
execution times for the Morph kernel with and without 
profiling- Table 4 shows that the overhead of profiling in 
Morph is very small, ranging from 0.07% to 0.24%. Although 
the absolute overhead is small, the relative difference between 
the highest iyortex) and lowest {sort) overheads is large. 
Referring to Table I, note that small overheads occur for 
programs with small text segments, and thai larger overheads 
occur for the programs with large text segments. Programs 



Table 4. On-l!sMe overhead for Morph contiDucHis profile 
collection. This table shows execution times and profiting 
overhead for our woikloads using test inputs and three 
different systems, the base Digital UNIX kernel (DU). the 
Morph kernel with profiling disabled (Morph-ofO. and the 
Morph kernel with profiling enabled (Morph). All systems 
used page-coloring for physical page selection. Times are the 
average of ten mm. in seconds, and were measured using the 
Alpha cycle counter. All measurements in this table arc for a 
warm file system cache. Standard deviations are given below 
times (in parentheses). Overhead is computed by comparing 
execution time with Morph monitoring (Moiph) to execution 
time using a Morph kernel with profiling disabled (Morph- 
off). 
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with small code working sets have relatively little chance of 
causing instruction cache conflicts with the Morph profiling 
code. For larger programs, it is more.t likely that a cache 
conflict between the application and the kernel profiling code 
will occur. 

To isolate these memory system effects and quantify the 
relationship between code woricing set and profiling overhead, 
wc constructed a set of artificial tests, strawman, strawman 
512KB, and strawman 1MB, Strawman is an eight-line C 
program consisting of a tight loop that increments an integer 
variable. The key features of this program are that it is 
compute bound with no I/O and no system calls, generates 
profile samples at the maximum rate, and makes minimal use 
of the memory system, requiring one line in the instruction 
cache and no data cache references. The trivial memory 
reference pattern of the strawman benchmark and the lack of 
system or library calls make it possible to isolate the CPU 
overhead of profiling. The results for the strawman test show 
that the lower bound for profiling overhead when samples are 
generated at the peak rate is about 0.07% (Table 5). 

Programs with larger working sets will tend to cause more 
cache conflicts with the kernel profiling code. The worst case 
occurs for a program that occupies the entire second-level 
cache but does not generate any self-conflict misses. For such 
a program, no misses occur when there is no kernel 
interference, but with kernel interference up to two misses can 
occur for each cache line of kernel code accessed, one to 
bring kernel code into the cache, and one to restore the 
displaced user code. We constnicted the strawman Si 2KB best 
to create this worst-case behavior. The profiling overhead for 
strawman 512KB is 0.50%, and this gives an estimate of 
worst-case overhead for our monitor. Our strawman 512KB 
test is not the absolute worst case for profiling overhead as it 
only interferes with the kernel in the instruction cache. An 
artificial test that interferes in a pessimal way for both 
instruction and data references is conceivable, but variability 
in kernel and user data layout makes it difficult to construct 

For large programs with self -conflict misses, kernel profiling 
activity will cause additional instruction cache misses for 
kernel references but will tend not to cause new user 
instruction cache misses. An extreme case for our direct- 
mapped 512KB cache is a program that loops over a 1MB 
block of code, such that each user instruction reference that 
crosses a cache line boundary generates a cache miss. We 
constructed the strawman 1MB test to create this behavior. As 
expected, the profiling overhead for this case is 0.17%, less 
than the overhead for the strawman 5 1 2KB case. 

To gain further intuition into sources of overhead for on-line 
monitoring, consider the two required activities. The first is 
the recording of PC samples during the clock interrupt 
routine. The Morph Monitor adds 72 instructions to each 
clock interrupt to record a PC sample in the sample buffer, or 
a total of approximately 74,000 instructions per second. The 
overhead of these instructions depends on their CPI. For 
example, for a CPI of three, the overhead would be about 
222,000 cycles, or approximately 0.10% for a 233 MHz 
machine. This is consistent with the overhead seen for 
strawman and for sort, the smallest benchmark. For 
applications with a large working set, the CPI of the profiling 
instructions will be higher. For a CPI of 10. the overhead 
estimate would be about 740,000 cycles, or approximately 




0.32% of the machine. This correlates well with the overheads 
observed for the test workloads. 

The other source of on-line sampling overhead is the writing 
of profile samples from the in-memory sample buffer to disk. 
This overhead depends on the amount of idle time, which 
does not generate samples, and on the frequency of logged 
events such as context switches and exec() system calls. With 
a 1024 Hz clock interrupt rate, profile samples are generated 
at a peak rate of approximately 8 KB per second. With our 
current monitor we have observed that 1-24% of the 
unprocessed profile information is for logged events, with the 
remainder occupied by profile samples. On days when the 
system is relatively busy and a large number of profile 
samples are generated, logged events occupy less than 12% of 
the total space. 

The Monitor flushes the sample buffer to disk every 10 
seconds. Assuming conservatively that the profile samples arc 
generated at the peak rate of 8 KB per second and that the 
buffer contains 75% profile samples and 25% logged events, 
the writing of t>ie profile buffer to disk will generate 1 10 KB 
of disk write operations every ten seconds. The samples are 
copied once on their way to disk, and diis copy activity 
dominates the CPU activity required for the write. This 
generates 110 KB of copy activity for each ten seconds 
interval in our pessimistic scenario. Assuming S-byte writes 
and a pessimistic ten cycles per write, this would add 
approximately 140.000 cycles of memory latency to each ten 
seconds of computation, or 0.006% overhead. The logging of 
profile samples to disk also consumes some amount of disk 
bandwidth, although at 1 1 KB per second the bandwidth is 
insignificant when compared to the write bandwidth of a 
modem high-performance disk. As the writes arc contiguous 
and the fr&qucncy of writes is low, the seek overhead required 
by writes should not be significant, except for a badly 
fragmented disk. 

Overall, these overbed estimates are consistent with the on- 
line profiling overheads observed for our test workloads. We 
conclude that profiling overhead in Morph is very small and 
has a negligible impact on overall performance. 

43^ Off-line Profile Processing Overhead 

Processing of raw PC samples occurs off-line and is deferred 
until off-peak hours to avoid interfering with interactive users. 
For our current version of the Manager, profile samples can 
be processed at a rate of 60 MB per minute. Again using the 
pessimistic assumption of 75% profile samples and 25% 
logging entries, profile samples are produced at a peak rate of 
about 640 KB per minute. Given that there arc 1,440 minutes 
in a day. 900 MB (uncompressed) is an upper bound on the 
amount of profile samples and logging information produced 
per day. It is not necessary to provide this much staging space 
for profile samples. In our experience, the amount of profile 
samples genenitfid in a day is typically tens of megabytes, and 
profile staging space can be limited to this level. To control 
the amount cf disk space required, the system can simply 
discard sampleT. IFor systems with sufficient idle time, use of 
a low-priority daemon to post- process Uie samples more 
frequently can also reduce staging space n;quii-ements. 

Long-term profile storage occurs in the form of basic block 
counts. After being computed from the raw samples, the basic 



23 



CXXID <XP 7710ieA_l_> 




block profiles of individual programs are integrated into the 
intermediate form representation of the program. Table 6 lists 
the disk space occupied by long-term storage of basic block 
profiles for our test application suite. The sizes given arc for 
uncompressed profiles. Compression further would reduce 
profile storage requirements. The profile size is roughly 
proportional to the size of the program code segment. As the 
number of basic blocks for a program is stable, the profile 
sizes do not change over time. For our benchmark suite, the 
space requirements for storage of basic block profiles ranges 
from 29% to 82% of the program text size. 



Program 


Executable 


Text 


ProfiJe 


% 


sort 


80 


56 


46 


82% 


gzip 


152 


96 


49 


51% 


lisp 


216 


112 


94 


84% 


rapeRjilay 


328 


248 


73 


29% 


espresso 


376 


248 


124 


50% 




400 


248 


121 


49% 




608 


416 


225 


54% 


perl 


664 


488 


367 


75% 


v(H-tez 


1280 


800 


440 


55% 



Table 6. Long-term Profile Storage. This table shows 
executable, text, and profile sizes in kilobytes for our test 
workloads. The right-most colunui shows profile size as a 
percentage of text segmcni size. 



The space requirement for long-term profile storage will tend 
to increase in more sophisticated versions of Morph. 
Heuristics to decide when to re-optimize a program will 
require history information to detect when changes in 
program usage patterns justify re-optimization. Also, although 
our current set of optimizations is effective with basic block 
profiles, new optimizations or refinements to our current 
opiimizatiorw may require different types of profiles. Further 
research will be requirisd to balance the tradeoff between 
profile storage space and effective re-optimizaiion. 

5. Discussion 

Our work to date has demonstrated that automatic profiling 
and optimization is both efficient and effective. There are a 
number of issues that must be resolved to make the system 
practical for more widespread use. 

Our current system uses trivial heuristics to decide when to 
re -optimize (i.e. re-optimize when triggered manually, re- 
optimize when new profiles are available). We suspect that 
more sophisticated heuristics (i.e. re-optimize when program 
usage patterns change) could lead to substantial 
improvements in the system. These heuristics must achieve a 
balance between optimizing rarely, so that re-optimization 
occurs promptly and within the budget of available idle CPU 
cycles, and optimizing frequently, to maximize the benefit 
from profile information. . In our continuing work we are 
developing the profile analysts techniques needed to 
implement improved heuristics. 

Our current system uses simple methods for combining 
profile information. The continuous generation of profiling 




data raises th? issue ot how to age pronie dau as it is 
collected over an extended period of time. There are a number 
of simple approaches, 'this is a part of our continuing work. 

Before, automatic optimization can be incorporated into 
mainstream, computing platforms, an infrastructure must be 
put into place that allows intermediaie form representations of 
program modules to be recovered from the application 
distribution. Thi£ issue must be fully resolved before Morph- 
style automatic optimization becomes widely used. From a 
purely technical point of view, the most straightforward 
solution would be to augment the shipping versions of 
executables with an intermediate form representation of the 
program. This solution has major practical drawbacks, 
however, as it requires new software standards and standards 
compliance across the spftware industry. Another possibility 
is to develop new ways of acquiring the additional 
information needed to derive intermediate form program 
representations from current executables. Digital's FX!32 
[CH97, Rub96) provides an example of progress in this 
approach. Our PostMorph tool represents another alternative. 

One of the greatest barriers to obtaining conclusive results for 
Morph is the lack of a compelling suite of interactive 
benchmarks. The ideal test for our system would be a suite of 
large, interactive benchmarks in a format that permits re- 
optimization. Unfoitunaiely. source or intermediate form 
representations for such berK;h marks are not commonly 
available. Recent developments in late-code optimization 
(CL96. Rub96, RVL97] offer some promise in addressing 
these issues. 

6. Conclusions 

This paper describes a system for continuous low-overhead 
profile collection designed to meet the unique requirements of 
automatic optimization. We achieve these low overheads 
through statistical sampling of all system activity and by 
deferring processing of profile samples for off-line 
processing. Our results show that continuous profiling can be 
achieved with very low overhead and that the resulting 
profiles are effective when applied in profile-based 
optimization. As high performance processors become 
increasingly dependent on super-scalar issue and deep 
memory hierarchies for sustained performance, the 
importance of these optimizations will increase. 

The prospect of practical operating system support for 
automatic optimization introduces a host of opportunities for 
new research, both in systems and in compilation. Our Morph 
testbed system demonstrates how a practical framework for 
automatic optimization can be implemented for a UNIX 
system, and provides a foundation for further research in this 
area. 

Acknowledgements 

Many , members of the Harvard Machine SUEF project 
contributed time and brainpower in maintaining the machine- 
specific code optimizations used in this paper. We would like 
to specifically acknowledge Gang Chen and Cliff Young, who 
provided support for the optimizations in the Morph Editor. 



24 



'OCID: <XP 771018A_.I_> 



m 



we would also like to thank the anonymous reviewers tor 
their useful comments. We would especially lilf.e to thank our 
shepherd Jeff Mogul who provided valuable feedback during 
the last stages of the preparation of this paper 

The Morph project is supported by gifts from Intel and Digital 
Equipment Coiix>raiioii. Bmd Chen is supported in part by an 
NSF CAREER Award. CCR-950I365. Mike Smith is 
supported in part by DARPA grant number NDA904-97-C- 
0225 and by a National Science Foundation Young 
Investigator Award, grant number CCR-9457779. 

Bibliography 

[ABD97] J. Anderson. L. M. Berc. J. Dean. S. Ghcmawat, M. 

R. Henzinger. S. Leung. R. L. Sites. M. T. 
Vandcvoorde. C. A. Waldspurger. and W. E. Weihl. 
"Continuous Profiling; Where Have All the Cycles 
Gone?" In Proceedings of the 16th ACM Symposium 
of Operating Systems Principles (in this volume). 
October 1997. See also 

http://www.research.digital.conVSRC/dcpi/ 

IBCL94] B. Bershad. J. B. Chen. D. Lee. and T. Ronvci. 

"Avoiding Conflict Misses Dynannicaily in Large 
Direct- Mapped Caches." In Proceedings of the Sixth 
International Conference on Architectural Support 
for Programming Languages and Operating Systems. 
ACM. pages 158-170. October 1994. 

[CB93] J. B. Chen and B. Bershad. 'The Impact of Operating 
System Structure on Memory System Performance." 
In Proceedings of the 14ih ACM Symposium on 
Operating System Principles, ACM, pages 120-133. 
December 1993. 

[CH97] A. Chemoff and R. Hookway, "DIGITAL FX!32- 
Running 32-Bit x86 Applications on Alpha NT." To 
appear in Proceedings of the VSENIX Windows NT 
Workshop, USENIX Associauon. Berkeley CA, 
August 1997. 

[Chr96] P. Christy. "IA-64 and Merced— What and Why." In 
Microprocessor Report, pages 17-19. 30 December 
1996. 

[CL961 R. Cohn and G. Lowney. "Hot Cold Optimization of 
Large Windows NT Applications." In Proceedings of 
the 29th Annual International Symposium on 
Microarchitecture, IEEE, pages 80-89. December 
1996. 

[CMH96] T. Conte. K. Menczes. and M. A. Kirsch, "Acciiraie 
and Practical Pro file- Driven Compikuon Using the 
Profile Buffer." In Proceedings of the 29th Antiual 
International Symposium on Microarchiteciure, 
IEEE, pages 201-21 1. December 1996. 

[Col951 Colusa Software. "Omni ware, A Universal Substrate 
for Mobile Code." White paper from Colusa 
Software, http://www.colfi50.cGm/ • 

(EA97] K. Ebcioglu and E. Altman. "DAISY: Dynamic 
Compilation for 100% Archireciural Compatibility." 
In Proceedings of the 24th Afw.ual Imemational 
Symposium on Computer Architecture, .ACM. , pages 
26-37, June 1997. 



Scheduli 



(Fis971 J. Fisher. Trace Scheduling: A Technique for Global 
Microcode Compaction." In IEEE Transactions on 
Computers, Volume 30, Number 7. pages 478-490. 
July 1981. 

IFF921 J. Fisher and S. Freudenberger, "Predicting 
Conditional Branches from Previous Runs of a 
Program." In Proceedings of the Fifth International 
Conference on Architectural Support for 
Programming Languages and Operating Systems. 
ACM. pages 85-95. October 1992. 

[FK961 M. Franz and T. Kistler. '*Sltm Binaries." Technical 
Report No. 96-24. Department of Information and 
Computer Science. University of California. Irvine. 
June 1996. 

[Goo971 D. Goodwin. "Inter procedural Dataflow Analysis in 
an Executable Optimizer." In Proceedings of the 
1997 ACM SIC PLAN Conference on Programming 
Language Design and Implementation, ACM, pages 
122-133. November 1997. 

(Gwe96] L Gwcnnap, "Bringing Parallelism Out of the 
Closet." In Microprocessor Report, pages 14-15. 9 
December 1996. 

(GWZ97] N. Cloy. Z. Wang. X. Zhang. J. B. Chen, and M. D. 

Smith, "Optimization with Statistical Profiles." 
Technical Report TR-02-97. Center for Reliable 
Computing. Division of Engineering and Applied 
Sciences. Harvard University. April 1997. 

[KH92] R. Kcssler and M. Hill. "Page Placement Algorithms 
for Large Real -Index Caches." In ACM Transactions 
on Computer Systems. Volume 1 0. Number 4, 
pagcs:338-359. November 1992. 

ILRW911 M. Lam, E. Rothbcrg. and M. Wolf, "The Cache 
Performance and Opumizaiions of Blocked 
Algorithms." In Proceedings of the Fourth 
Imemational Conference on Architectural Support 
for Programming Languages and Operating Systems, 
ACM, page^ 63-74. April 1991. 

ILS951 J. Larus and E. Schnarr, "EEL: Machine Independent 
Executable." In Proceedings of the ACM SICPLAN 
'95 Conference on Programming Language Design 
and Implememaiion, ACM. pages 291-300. June 
1995. 

Open Software Foundation Research Institute. ANDF 
Collected Papers. Volume IV. Open Software 
Foundation. December 1993. See also 
http://www.opcngroup.org/RI/andf/ 

K. Pettis and R. Hansen. "Profile Guided Code 
Positioning." In Proceedings of the ACM SICPLAN 
'90 Conference on Programming Language Design 
and Implementation, ACM, pages 16-27, June 1990. 

N. Rubin, "FX!32." Work-in-progress talk at the 
Seventh International Conference an Architectural 
Support for Pk'ogramming Ixinguages and Operating 
SysttTffj;. Octohci- 1996. Sec also 
htsp://wwvv.digital.conv'info/semiconductor/amt/fx32/ 

fx.himi 



[OSF931 



[PH90] 



[Rub96j 



25 



XXID;<XP 771018A I > 



[RVL971 T. Romcr. G. Voelker. D. Lee. A. Wolman, W. 

Wong. B. Bershad, H. Levy, and J. B. Chen. "Etch, 
an I nstm mentation and Optimization tool for Win32 
Programs." To appear in Proceedings of the USENDC 
Windows NT Workshop, USENIX Associauon. 
August 1997. See also 

hup7/www.cs. washington.edu/homes/bcrshad/Etch/ 

[SE94J A. Srivastava and A. Eustace, "ATOM: A System for 
Building Customized Program Analysis Tools." (n 
Proceedings of the ACM SICPLAN '94 Conference 
on Programming Language Design and 
Implementation, ACM. pages 196-205. June 1994. 
Sec also Research Report 94/2, Western Research 
Laboratory, Digital Equipment Corporation. 

(Smi9lJ M. Smith, 'Tracing with Pixie." Technical Report 
CSL-TR-9 1-497. Computer Systems Laboratory. 
Stanford University. Stanford. CA, USA, November 
1991. 

tSmi961 M. Smith. "Extending SUIF for Machine-dependent 
Optimizations." In Proceedings of the First SUIF 
Compiler Workshop, Stanford, CA. pages 14-25, 
January 1996. 

(SUIF94J Stanford SUIF Compiler Group. "SUIF: A 
Parallelizing and Optimizing Research Compiler." 
Technical Report CSL-TR-94-620, Computer 
Systems Laboratory. Stanford University. May 1994. 

[SW92) A. Srivastava and D. Wall. "A Practical System for 
Iniermodulc Code Optimization at Link-Time." In 
Journal of Programming Languages. Volume 1, 
Number 1. pages 1-18, March 1993. See also 
Research Report 92/6, Western Research Laboratory, 
Digital Equipment Corporation. 

(TDF90J G. Taylor. P. Davies. and M. Farmwald, "The TLB 
Slice - A Low-Cost High-Speed Address Translation 
Mechanism." In Proceedings of the 17^ Annual 
International Symposium on Computer Architecture, 
ACM, pages 355-363, 1990. 

(Wal921 D. Wall, "Systems for Late Code Modification." In 
Code Generation •• Concepts, Toots. Techniques. 
Springer- Verlag. pages 275-293. 1992. 

fWP87J D. Wall and M. Powell. "The Mahler Experience: 
Using an Intermediate Language as the Machine 
Description." In Second International Symposium on 
Architectural Support for Programming Languages 
and Operating Systems, ACM. April 1987. See also 
Research Repon 87/1 , Wcstem Research Laboratory, 
Digital Equipment Corporation. 

IYJK97J C. Young. D. Johnson, D. Karger. and M. Smith, 
"Near-optimal Intraproccdural Branch Alignment." 
To appear in Proceedings of the ACM SICPLAN '97 
Conference on Programming Language Design and 
Implementation, ACM, pages 183-192. June 1997. 



26 



XXID; <XP 77101BA_I_> 



• 1» 

XP-01 0334956 



A Hardware-Driven Profiling Scheme for Identifying Program Hot 
Spots to Support Runtime Optimization 

Matthew C. Merten Andrew R. Trick Christopher N. George John C. Gyllenhaal Wen-mei W. Hwu 

Center for Reliable and High-Performance Computing 

Department of Electrical and Computer Engineering «rs ' f\y ~t^r~ >c=cu^ 
University of Illinois ^ ^^fT^ 
Urbana, XL 61801 P.* lB>fe- ] M> 
Email: {merten, atrick, c-george, gyllen, hwu}@crhc.uiuc.edu \ ^ 



Abstract 

This paper presents a novel hardware-based approach 
for identifying, profiling, and monitoring hot spots 
in order to support runtime optimization of general- 
purpose programs. The proposed approach consists of 
a set of tightly coupled hardware tables and control 
logic modules that are placed in the retirement stage 
of a processor pipeline removed from the critical path. 
The features of the proposed design include rapid de- 
tection of program hot spots after changes in execu- 
tion behavior, runtime-tunable selection criteria for hot 
spot detection, and negligible overhead during appli- 
cation execution. Experiments using several SPEC95 
benchmarks, as well as several large WindowsNT ap- 
plications, demonstrate the promise of the proposed de- 
sign. 

1 Introduction 

Optimizing compilers can gain significant perfor- 
mance benefits by performing code transformations 
based on a. program's runtime profile. Traditionally, 
profiles are collected by running an instrumented ver- 
sion of the executable. However, because this profiling 
technique incurs a large overhead, applications are only 
profiled prior to distribution on a set of sample inputs. 
Consequently, they cannot be adaptively optimized in 
order to account for changes in program behavior or to 
take advantage of variations in the production system. 
More recently, low-overhead methods of profiling have 
been developed based on statistical sampling [1] [2] 
[3) [4]. The basic approach, however, remains the same: 
profile information for the program's entire execution 
is averaged into a large database and later fed back 
into a static compiler. This approach is undesirable 



for three reasons. 1) The entire profile of each appli- 
cation must be continuously maintained at runtime oh 
the production system, 2) The profile represents only 
average behavior across an extended period of time. 3) 
A .significant length of time may pass before variations 
in the program's behavior are detected. This paper ad- 
dresses these problems by presenting a new method for 
rapidly, accurately, and transparently collecting profile 
information with minimal runtime overhead. The pro- 
posed hardware approach provides a strong foundation 
for a runtime optimizing system. 

Although static optimizations are essential to ap- 
plication performance, additional opportunities can be 
exploited by continuously profiling and reoptimizing 
the code. For instance, many types of optimizations 
cannot be performed without more specific runtime in- 
formation. These include optimizing code based on 
value invariance [5] [6] and inlining dynamicaUy linked 
Ubrary functions [7]. The traditional, static approach 
also has the disadvantage that aggressive optimization 
can only be applied selectively, based on average profile 
weights and other criteria [8] [9]. Optimizations that 
cause excessive code growth, for example, must be ap- 
plied conservatively by the static compiler [10]. How- 
ever, a runtime optimizer can more aggressively apply 
these optimizations by targeting small regions of the 
prograjn that represent the critical execution path at a 
particular time. To support targeted optimization, our 
runtime profiler extends the traditional role of profilers 
and captures temporal information as well as report- 
ing the relative importance of basic blocks. The profiler 
identifies code that can be optimized quickly compared 
with amount of execution time that will subsequently 
be spent in code. Additionally, code that yields a sig- 
nificant short-term benefit is given priority since a run- 
time optimizer may not have enough memory at its 
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disposal to retain all optimized code indefinitely. 

Wc have observed that many applications exhibit 
behavior conducive to runtime profiling and optimiza- 
tion. For example, program execution often occurs in 
distinct phases, where eacli phase consists of a set of 
code blocks that are executed with a high degree of 
temporal locality. When a collection of intensively ex- 
ecuted blocks also has a small static footprint, it rep- 
resents a highly favorable opportunity for runtime op- 
timization. We will refer to such sets of blocks and 
their corresponding periods of execution as hot spots. 
A runtime optimizer can take advantage of execution 
phases by isolating a group of hot spots that are active 
for each phase. Ideally, aggressively optimized code 
would be deployed early in the phase and the optimized 
code used until execution shifts to another phase. Op- 
timized hot spots that are no longer active may then 
be discarded, if necessary, to reclaim memory space for 
newly optimized code. 

Several common types of programs exhibit this exe- 
cution phase behavior [11], such as compilers, graphics 
packages, and scripting engines. Because these appli- 
cations typically implement a wide range of function- 
ality, it is often intractable to aggressively optimize 
for every possible runtime scenario. Yet, these pro- 
grams should benefit from runtime optimization be- 
cause they are divided into focused tasks, or phases of 
execution. Compilers, for example, allow the user to 
select a variety of optunizations and other behavior at 
runtime, and they will often perform multiple passes 
across the input, exercising different functionality each 
time- Graphics packages also allow a wide variety of 
tasks to be selected at runtime, each of which may 
perform an intensive transformation on the data. Fur- 
thermore, it is likely that these transformations can be 
highly tuned when the data set is available during opti- 
mization. A similar situation exists in many scripting 
engines because a sizeable number of high-level com- 
mands arc available that perform intensive operatioxis 
on the data. Yet, for any single script, only a small 
selection of these routines may be utilized. 

A concrete example of this behavior can be seen in 
134.perl running the jumble training input from the 
SPEC95 benchmark suite. As shown in Figure 1, this 
benchmark contains three primary, distmct phases of 
execution with one hot spot per. phase. Hot spot 1 
runs for 72 million instructions, hot spot 2 for 1.35 
billion, and hot spot 3 for 200 miUion. The first hot 
spot consists of reading in a dictionary and storing it 
in an internal data structure. The second hot spot 
processes each word in the dictionary, and the third 
scrambles a selected set of words in the dictioniiry. 

The second of these hot spots serves as an excel- 
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Figure 1. Important branches executed in 
each execution sample for 134.perl. Each 
data point represents a branch that executed 
at least 40 times within the sample duration 
of 10,000 branches. The sample period is 
2,000,000 branches. 

lent example of why runtime optimization is needed. 
The input script causes perl to call split which breaks 
up an input word into individual letters, then to call 
sort which sorts those letters. The first function, split, 
calls a complicated regular expression matcher with an 
empty regular expression. This region of code would 
benefit from partial inlining and code layout, followed 
by classical optimization of the few exercised blocks in 
the regular expression matcher and split. A static com- 
piler could perform this optimization, but the larger 
code size and compile time would not benefit most in- 
put scripts. The second function, sort, calls the Hbrary 
function qsort which then calls a perl-specific compar- 
ison function. Less than half of the code in the com- 
parison function is ever executed because only single 
characters are actually sorted. This is another example 
where inlining is important because of the very frequent 
calls to the comparison function. However, a link time 
optimizer or runtime optimizer is needed to support in- 
lining across library and application boundaries. Fig- 
ure 2 shows a branch profile for a typical 10,000 branch 
sample of this second hot spot and clearly demonstrates 
that a small number of static branches account for the 
vast majority of the dynamic instances m the sample. 

Our strategy for detecting runtime optimization op- 
portunities consists of two stages. Initially, our profiler 
must detect hot spots as they emerge during execution. 
To do this, we track the execution frequency of code 
blocks across a relatively short period of time. When 
a set of blocks fit the characteristics of a hot spot, the 
profiler notifies the runtime optimizer. During the de- 
tection stage, arc weights are also accumulated for the 
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cache. Y is accessed three times followed by a long 
stream of X's. If X is not replaced in the cache by Y, 
then three I-cache misses will result » one for each ac- 
cess to Y. An LRU poUcy predicts that the next access 
will be Uke the previous one and will replace X with Y 
in order to save a miss. After the third Y, X will again 
be accessed, resulting in another miss. In this case, the 
LRU policy was able to reduce three misses down to 
two but the replacement of X was required. 

The BBS, however, must accurately record the most 
frequently executed blocks and their jjrofilcs rather 
them the most recently accessed. Therefore » imple- 
menting an LRU replacement policy and allowing a 
rare block to replace a frequently executed block is un- 
acceptable. Instead, branches that conflict with exist- 
ing BBB entries are simply discarded. The function of 
branch replacement is controlled by the <iforementioncd 
refresh timer. A more serious conflict occurs when two 
branches are important and map into the same cache 
location. As long as these conflicts ai-e relatively rare, 
the runtime optimizer can still recover by inferring the 
profile value of the missing branches. Using Kirchhoff's 
Fir.st Law and other heuristics [12], the input and out- 
put weights of the block can be estimated based on the 
branches that were accurately profiled. 

While the I-cache and STB coUect information 
about important instructions and branches, respec- 
tively, it is infeasible to combine the BBB with either 
of these two structures. In addition to the previously 
discussed replacement policy differences, the BTB and 
I-cache are both potentially on the critical path, and 
adding complexity in these two structures may ad- 
versely affect cycle time. In addition, since only true 
branch behavior should affect the BBB counters, incor- 
rectly speculated accesses must not be allowed to affect 
the BBB entries. Although the BBB counters in these 
structures could be updated after the branches are re- 
solved, the strict timing constraints and complex hard- 
ware required to interface with the BTB and I-cache 
wai rant the cost of storing the BBB entries separately. 
Thus, the BBB is best implemented as a stand-alone 
structure in the retirement phase because of its sim- 
plicity and inherent accuracy. 

2.1.2 Hot Spot Detection Counter 

Once candidate branches have been identified in the 
Branch Behavior Buffer, they must be monitored to de- 
termine whether the corresponding blocks may be con- 
sidered a hot spot and, thus, useful candidates for op^ 
timization. We have found two criteria that should be 
satisfied before a group of candidate blocks is dubbed a 
hot spot. First, the candidate blocks should be active 



for a specified minimum amount of time. Second, the 
candidate branches should account for at least a cer- 
tain percentage of the total branches executed during 
this time. We define the minimum percentage over the 
time interval to be the threshold execution percentage 
and the actual dynamic percentage over the interval to 
be the candidate execution percentage. 

In order to minimize disruption of the system dm-- 
ing hot spot detection, we perform the detection in 
hardware using a Hot Spot Detection Counter (IIDC), 
shown in Figure 3. The Hot Spot Detection Counter 
is an up/down counter used to detect when the set 
of candidate branches reaches the threshold execution 
percentage. The counter is implemented as a saturat- 
ing adder that is initialized to the maximum value. It 
counts down by D for each candidate brancli executed 
or counts up by / for each non-candidate branch ex- 
ecuted, where the determination of D and / will be 
discussed later. When the candidate execution per- 
centage exceeds the threshold percentage, the counter 
begins to move down. If the candidate execution per- 
centage remains higher them the threshold for a long 
enough period of time, the counter will decrement to 
zero. At this point, the operating system will be trig- 
gered either via an interrupt or by setting a flag that is 
checked the next time the operating system is invoked. 

The difference between the candidate execution per- 
centage and the threshold execution percentage deter- 
mines the rate at which the counter decrements (i.e., 
the rate at which the haurdware identifies the hot spot). 
This corresponds to our observation that hot spots be- 
come more desirable as they either account for a larger 
percentage of total execution or run for a longer period 
of time. It is assumed that hot spots which have been 
active over a longer period of time are less likely to 
be spurious in their execution emd are more likely to 
continue to run after optimization has been complete. 

Our experiments have shown this approach to work 
quite well and to detect all of the major hot spots in 
our benchmarks. There arc thi'ee primary scenarios 
where there is no hot spot to be found, and thus the 
HDC will never reach zero: 

1 . Few branches execute with sufficient frequency to 
be marked as candidates, and collectively, they 
do not constitute a large percentage of the total 
executk>n. Thus, even if they were classified as 
a hot spot and optimized, only a small benefit is 
likely. 

2. The number of branches that execute frequently 
enough to be considered candidates is too large to 
fit into the BBB. If the branches that are able to 
enter the BBB do not account for a large enough 
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percentage of execution, they will not be iden- 
tified as a hot spot. This may happen if the 
execution profile of the region is very flat. Al- 
though some benefit may be gained by optimiz- 
ing all the frequent branches, the overhead of op- 
timizing sucli a large region would most likely be 
prohibitive. 

3. The execution profile is not consistent. In this 
case, a smail set of branches maj' account for a 
large percentage of execution over a short time, 
but the execution shifts to a different region of 
code before the Hot Spot Detection Counter sat- 
urates. Optimizing a region of code that only 
executes spuriously is unlikely to yield much ben- 
efit- 
As in these three scenarios, branches that collec- 
tively do not constitute a hot spot may be locked 
into the BBB. Since the HDC has not identified these 
branches as a hot spot, it is possible that the execution 
has shifted to a new, potentially more important region 
of code or is not an attractive optimization opportu- 
nity. Therefore the BBB will be periodically purged by 
the reset timer to make room for new branches. This 
timer is similar to the refresh timer BBB but clears all 
entries in the table, rather than only the non-candidate 
branches- The reset interval should be large enough to 
allow the HDC to saturate on vaUd hot spots but small 
enough to allow quick identification of a new phase of 
execution. 

Once the threshold execution percentage Xt re- 
quired for a hot spot has been selected, the HDC incre- 
ment and decrement values should be chosen. D is the 
decrement value when a candidate branch is encoun- 
tered {candidate hit), and / is the increment value for 
a- branch that is not in the table or is not yet marked 
as a candidate {candidate miss). Let X be the actual 
candidate execution percentage. For a given D and 7, 
the counter will decrease when the candidate execu- 
tion percentage multiplied by the decrement value is 
greater than the percentage of non-candidates multi- 
plied by the increment value. This is represented by 
the equation: 

X*{-D) + {l-X)*{I)<0 ■ (2) 

Rearranging the terms and solving for X yields the for- 
mula for minimum percentage: 



Equation 3 shows that the counter decreases when the 
percentage of execution is above the threshold, as de- 
termined by /and D. 



Given the increment and decrement values, the size 
of the HDC can be chosen to achieve a minimum de- 
tection latency. Let N be the minimum number of 
branches executed before a hot spot is detected. For 
detection to occur, the following inequality must hold: 

r^*X*{-D)+N*{l-X)*{I) < -HDCMAX.VAL (4) 

The latency for detecting a hot spot is determined by 
the following equation: 

_ HDCMAX.VAL 
- {D + I)*iX-Xt) 

As the candidate execution percentage further sur- 
passes the threshold, the detection latency decreases. 
The latency can also be decreased independently of the 
candidate execution percentage by increasing / and D 
such that Xt remains constant. 

2:2 Monitor Table 

The purpose of the Monitor Table is to determine 
when hot spot profiUng is necessary.* This hardware 
mechanism is continuously running, watching program 
execution and comparing the executing branches to 
those already determined to be in hot spots. When 
the program is executing in the known hot spots, the 
system is said to be in monitor mode, which is the 
steady-state mode of execution. The system enters pro- 
file mode, and the BBB is enabled when the Monitor 
Table determines that execution has strayed from the 
known set of hot spots. Note that the Monitor Table 
contuiues to operate during profile mode, watching for 
execution to return to the set of known hot spots. If 
this situation were to occur, the BBB would be deacti- 
vated, since it is unnecessary and costly to extract and 
possibly optimize a hot spot that has akeady been pro- 
cessed. If a new hot spot is found, the operating system 
is notified which then extracts the hot spot from the 
BBB; the new hot spot is entered into the Monitor Ta- 
ble; and the BBB is deactivated. When new, optimized 
code is ready for deployment, the branches in the new 
code will be added to the table just prior to deployment 
to avoid hot spot detection in the optimized code. 

pi order for the system to detect when execution 
strays from the known set of hot spots, the hardware 
must be. aware of those hot spots which have already 
been identified. In an ideal Monitor Table, the ad- 
dresses of all branches in all known hot spots would be 
placed into a tag array as shown in Figure 5. When 
a branch is executed, ;the instruction address is looked 
up in the, tag array. Its presence in the array indicates 
execution in. a hot spot. An up/down counter called 
the Monitor Counter is used to track long-term execu- 
tion trends and operates much like the HDC. It counts 



iXX;iD: <XP_10334956A_I_> 




SnrUd SUU« BranettM 

Figure 2. Profile distribution for the second 
primary hot spot in 134.perl. Branches sorted 
from most to least frequent. 

hot spot to aid profile-driven optimizations. Once a hot 
spot has been detected, we add the hot spot to a mon- 
itor table and disable the profiling hardware. During 
the second stage, we monitor the program and compare 
the executing code against the table of currently active 
hot spots. When execution strays from the monitored 
hot spots, the profiler is enabled to allow detection of 
new hot spots. 

The rest of this paper is organized as follows: Sec- 
tion 2 explores the mechanisms for performing hot spot 
detection during runtime; Section 3 details our exper- 
imental environment and results; Section 4 examines 
research related to our endeavors; and Section 5 sum- 
marizes our efforts and future directions. 

2 System Components 

Our proposed hot spot detection scheme uses three 
criteria to classify a region of code as a hot spot. First, 
the region must have a small static code size to facili- 
tate rapid optimization. Second, the hot spot must be 
active over a ccrtaiin minimum time interval so that it 
is likely to liave an opportunity to benefit from runtime 
optimization. Finally, the instructions in the selected 
region of code must account for a large majority of 
the dynamic execution during its active time interval. 
These three criteria are sufficient to detect code regions 
that can benefit most from runtime optimization with- 
out placing unnecessary restrictions on the type of hot 
spots that can be identified. 

Our proposed scheme must both detect when the 
execution is in a hot spot and also gather profile data 
for that hot spot. When a valid hot spot has been 
discovered, the operating system is notified so that" it 
can invoke the runtime optimizer. Additional hard- 
ware is used to ensure that further notifications occur 



only when significant new hot spots are detected. Our 
proposed implementation consists of the following com- 
ponents: 

Hot Spot Detector Hardware that collects control 
flow profiles and identifies a collection of impor- 
tant blocks that comprise a hot spot. The Hot 
Spot Detector contains a Branch Behavior BuflTer 
(BBB) to store the branches and profile data and 
a Hot Spot Detection Counter (HDC) to track 
the percentage of dynamic execution accounted 
for by the hot branches. 

Monitor Table Hardware that maintains a collection 
of previously discovered hot spots that the run- 
time optimizer has already examined. It moni- 
tors program execution, noting when execution 
strays from previously detected hot spots so that 
profiling can be restarted. 

Operating System Support Software that reads 
the hot spot information from the BBB and adds 
the hot spot blocks to the Monitor Table. It can 
also assemble the blocks into a region in order to 
call the runtime optimizer. 

The Hot Spot Detector and Monitor Table can eas- 
ily tolerate a rather large latency when recording infor- 
mation about program execution. Therefore, our pro- 
posed hardware can be deeply pipehned and located off 
the critical path so that it does not affect processor per- 
formance. While invoking the operating system docs 
incur a penalty, our experimental data shows that the 
number of operating system interrupts is insignificant 
relative to the total execution time. This is because 
the operating system is typically invoked only when a 
new optunization opportunity is detected - 

The following subsections describe each of the three 
main components in the context of a shigle process 
system. This section concludes with a description of 
the extensions necessary for a multiprocess system and 
other enhancements. 

2,1 Hot Spot Detector 

The first step in the process of identifying hot spots 
is to detect the frequently executed blocks. By employ- 
ing a hardware scheme, operating system overhead can 
be eliminated from this portion of the process and de- 
tection can be completely transparent to the system. 
Such blocks can be easily collected in hardware by gath- 
ering the branches that define their boundaries. By ex- 
amining branch execution and direction information, a 
control flow graph with arc weights can be constructed. 
Through this hardware scheme, reasonably accurate 
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Figures. Hot Spot Detector hardware. 

profile information will be collected whicli can bene- 
fit the runtime optimization effort, 

2.1.1 Branch Behavior Buffer 

To achieve these collection goals in hardware, a cache 
structure reminiscent of a Branch Target Duffer (BTB) 
will be used. As depicted in Figure 3, the structure, 
called a Branch Behavior Buffer (BBB), will be indexed 
on branch address and will contain several fields: tag 
(or branch address), branch execution count, branch 
taken count, and branch candidate flag. When the pro- 
cessor retires a branch instruction, the instruction ad- 
dress will provide the index into the BBB. Each time 
a branch address is found in the BBB, its execution 
counter is incremented. If the branch is taken, the 
taken counter is also incremented. The combination of 
these two values constitutes an arc weight profile of the 
recently executed code. Note, it is possible for a branch 
to execute more times than can be represented by the 
execution counter. When the execution counter reaches 
its maximum value, the hardware stops incrementing 
both the execution and taken counters to preserve their 
ratio. As long as the number of branches that reach 
saturation is smeill, the profile will still accurately rei> 
resent the relative importance of the branches. 

The puri^ose of the BBB is to collect and profile 
only frequently executed branches whose correspond- 
ing blocks account for a vast majority of the dynam- 
ically executing instructions. Branches that execute 
frequently during profiling may be part of a potential 
hot spot, and we refer to such . branches as candidate 
branches. In order to make this determination, an ex- 
ecuted branch is temporarily allocated an entry in the 
BBB and monitored over a short interval: If the exe- 
cution of that branch is frequent enough during that 
interval, its execution counter will surpass a predefined 
candidate threshold. The candidate threshold is asso- 
ciated with a single bit in the execution counter, as 
shown in Figure 4. When this bit is triggered, the can- 



Figure 4. Fields in each Branch Behavior 
Buffer entry. 

didate flag is set, and the entry remains marked as a 
candidate wliile the execution counter continues to col- 
lect profile data. Thus, the DBB only needs to be large 
enough to hold, the candidate branches and potential 
candidate branches for a given interval. 

To ensure that only frequently executing brandies 
are marked as candidates, a refresh timer periodically 
clears entries from the BBB that have not surpassed 
the candidate threshold.- The refresh timer is simply 
a global counter that increments each time a branch 
uistruction is executed 'and triggers a refresh of the 
BBB when it reaches a certain value. Refreshing the 
BBB flushes the insignificant entries and ensures that 
each branch marked as a candidate accounts for at least 
a minimum percentage of the total dynamic branches 
during a fixed interval. The minimum percentage of 
execution required of candidate branches can be ex- 
pressed as a candidate ratio. Thus, 

CANDIDATE JtATIO = 2^2", (1) 

given that the size of the refresh tuner is n bits, and 
the candidate threshold is represented by bit bk of the 
execution counter. 

To accurately represent a hot spot, the BBB must 
be able to allocate entries for most of the branches that 
are hnportant in the hot spot. The BBB size should be 
equal to or larger than the total number of branches 
present within a hot spot. If the BBB is too small, then 
insignificant branches may prevent important branches 
ft-om entering during the first few refresh intervails. Sta- 
tistically, the important branches will eventually get 
entries in subsequent refresh intervals, but the profile 
accuracy may be somewhat compromised and the re- 
porting of hot spots may be delayed. 

Another problem is the possibility of indexing con- 
flicts that are inherent in any cache structure. As has 
been seen in the BTB or I-cache, making the struc- 
tures set associative elimmates many of the indexing 
conflicts. 

Although organized like a set associative cache, the 
BBB's behavior differs in terms of replacement policy. 
The BTB and I-cache often use a least recently used 
(LRU) policy in order to reduce cache misses. For ex- 
ample, suppose X and Y are insti-uctions that map to 
the same I-cache location, and X is currently in the 
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Figure 5. Monitor Table hardware. 

down when a hot spot branch is , executed and counts 
up when a nou-hot spot branch is executed. V/heu the 
Monitor Counter saturates at the maximum vaJue in 
monitor mode, a high percentage of recent branches 
outside of the Icnown hot spots have been executed, in- 
dicating a possible transition to a new hot spot. At this 
time, profile mode is resumed. Similarly, when the sys- 
tem is in profile mode and the Monitor Counter reaches 
zero, program execution must have returned to the set 
of known hot spots. At this time, the BBB is deac- 
tivated, and monitor mode is resumed. When mon- 
itor mode is entered from profile mode, the Monitor 
Counter is initialized to zero, indicating that execution 
is in a hot spot. This is a valid assumption because 
either the Monitor Counter just saturated at zero to 
end profile mode, or a hot spot was just detected and 
execution is likely to contmue in the hot spot. 

As in the HDC, the increment and decrement values 
for the Monitor Counter determine the threshold ratio 
of hot spot to non-hot spot branches. Although a min- 
imum ratio of hot spot to non-hot spot branches must 
be maintained to remain in monitor mode, this ratio 
should not be as high as in the HDC. A lower ratio 
is used for the monitor hardwai-e to allow the behav- 
ior of the hot spots to vary slightly without reentering 
profile mode. Once this ratio is determined, the same 
formula used for the HDC can be used to derive suit- 
able increment and decrement values for the Monitor 
Counter. 

2.3 Operating System Support 

The interface between the BBB and the operating 
system was designed to be simple: the BBB signals the 
OS only when a hot spot ha^ been detected, and the 
operating system copies the BBB table contents into 
system memory upon hot spot detection. This interface 
minimizes operating system involvement with hot spot 
detection. Specifically, cycles are not stolen away by 
the OS. from the user application unless a hot spot is 



actually detected. 

In order for the OS to read the BBB contents, the 
BBB must be either hardware addressable or read- 
able through special instructions. Once extracted, the 
branches are stored internally for access by the run- 
time optimizer. They are al.so processed and written 
to the Monitor Table, again via hardware addressing 
or via special instructions. Since the table is a cache, 
conflicts over entries in the Monitor Table can arise. In 
this situation, the OS must either remove an old hot 
spot that contains a conflicting branch, thus accurately 
preservuig a smaller set of hot spots, or simply elimi- 
nate one of the conflicting instructions from the table, 
thus slightly compromising the recognition of one of the 
known hot spots. Both of these situations are subopti- 
mal because code previously detected as being part of 
a hot spot may now trigger the profiling system. The 
OS is also responsible for deploying optimized code and 
entering it into the Monitor Table. Furthermore, the 
operating system has control over several system pa- 
rameters (BBB increment and decrement values, etc.) 
and can adjust them based on quaUty of the hot spots 
being collected. The actual operatmg system poUcies 
are beyond the scope of this paper. 

2.4 Multiprocess support 

Thus far, the hardware design has assumed single 
process execution. Operation becomes sUghtly more 
complicated when considering the context switciiing 
abilities of microprocessors. It is the responsibility of 
the operating system to correctly maintain the state of 
the proposed hardware in a multiprocess environment 
as it is required to do for traditional hardware compo- 
nents. 

Because of the expense of swapping out even a sub- 
set of the BBB during context switches, the hot spot 
detection hardware is designed to operate in single pro- 
cess mode, persisting across context switches. Because 
the BBB responds to hot spots quickly, it is only active 
during a small percentage of an application's execu- 
tion. Becau.se of this low utilization, a single BBB can 
be shared among multipie processes. The BBB could 
be configured to search for hot spots associated with 
a particular process ID, thread ID, or code segment. 
This ID would be stored in a control register. 

Unlike the BBB, the Monitor Table is always hi use 
by each process. Since swapping a table in and out at 
each context switch would be extremely costly, all pro- 
cesses could share a single table. In order to accomplish 
this, an up/down Monitor Counter is necessai-y for cadi 
process or active subset to effectively track hot spot be- 
havior. Furthermore, each entry in the Monitor Table 
must also be tagged with its process ID. This ID serves 
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as a tag for comparison purposes when determining a 
hit or miss and for determining which Monitor Counter 
to update. 

When a Monitor Counter saturates indicating that 
profiling is necessary, the BBB must first be allocated 
to that particular process. A simple check of the BBB 
process ID control register can be made to determine 
BBB ownership. If the requesting process owns the 
BBB, profiling can continue without delay. Otherwise, 
arbitration must occur between all of the processes re-' 
questing use of the BBB. This aibitration may be im- 
plemented in the hardware itself or within the OS. Al- 
though processes may be denied use of the BBB for 
a short time, the BBB may be acquked by the next 
waiting process as soon as the BBB resets or the HDC 
saturates. 

2.5 Enhancements to the base hardware 

Several enhancements might be made to the base 
hardware. The first enhancement would be to reduce 
the size of the BBB by considering only conditional 
and mdirect branches. The execution and taken profile 
weights of blocks with unconditional branches or direct 
calls can be determined by inference via Kirchhoff's 
First Law and need not be profiled. However, this ap- 
proach requires that the runtime optimizer spend more 
time analyzing and constructing the control flow graph. 

A second enhancement would be to index into the 
BBB with a combination of the branch's address and 
Its direction. Thus, the taken and not taken paths 
of a particular branch would be recorded in separated 
entries allowmg for the eUmination of the taken counter 
m each BBB entry, mile this approach would result in 
either an increase in the size of the BBB or a reduction 
in the number of distinct branch addresses in the BBB, 
it would allow the system to detect finer changes in 
program, and hot spot, behavior. For example, control 
flow changes within a particular set of blocks would 
now be detected and possibly reoptimized. 

A third enhancement would be to include support 
for profiling the arc weights of indirect branches. Cur- 
rently, profilmg determines only the branch weights of 
these branches. An additional table indexed on a com- 
bination of the branch and target addresses could be 
used to store the actual profile. The BBB entry for that 
branch would still gather the branch execution count 
and determine branch candidacy. 

A fourth enhauceurient would be to add a secondary, 
coarse-grained component, to the Monitor Table. Tliis 
secondary table would track ranges of addresses rather 
than single branches. Each range table entry would 
consist of a base address and size, and any addres.s 
falling inside the range would be considered part of a 



Benchmark 



Num. 
I lusts. 



099. go 

124.m88ksim 



126.gcc 



l29.compress 



lao.ii 



132.ijpeg 



134. perl 



14 7. vortex 



MSWord{A) 



.MS Word (B) 



MSExcel 



Adobe Photo- 
Deluxe(A) 



89.5M 



120M 



1.18B 



Actions TVaced 



2stone9.in training input 



clt.in training input 



2.88B 



151M 



1.56B 



2.34B 



2.20B 



amptjp.i training input 



test.in training input 
count enlarged to 800k 



train.Isp training input (6 queens) 



vjgo.ppm training input 



jumble.pl training input 



32oM 
912M 



vortex.in training input 



opened 16.0 MB .doc file, 
searched for number, then closed 



160M 



loaded 25 page .doc file, repaginated 
ran built-in word count, selected entire 
document, changed font to Arial, undo 
change midway through, closed file 



390M 



Adobe Photo- 
Deluxe(B) 



Ghostview 



108M 
3.17B 



generated silicon diffusion data 
from \a3 script &£ graphs from data 



loaded detailed tiff image, brightened, 
increased contrast, and saved as 
PhotoDeiuxe image 



exported detailed tiff image to 
encapsulated postscript 



loaded ghostview, loaded 9 page ps 
file, viewed and zoomed in on 4 pages, 
performed text extract 



Table 1. Benchmarks used for hot spot detec- 
tion experiments. 

hot spot. Since a hot spot would be returned from 
the optimizer as a single, contiguous region with a 
larger code size due to aggressive optimizations, the 
region could be added to the range table and thus help 
avoid conflicts among entries in the fine-grained table. 
Clearly, this enhancement would require a higher de- 
gree of operating system support to manage. 

3 Experimental Evaluation 

IVace-d riven simulations were performed for a num- 
ber of benchmarks in order to explore hot spot chai- 
acteristics and to establish the effectiveness of the pro- 
posed hot spot detection scheme. Both SPBCJNT95 
and common l^mdou^iA^r applications were simulated 
to provide a broad spectrum of typical programs. 
These benclimarks are summarized in Table 1. The 
eight applications from the SPECINT95 benchmark 
suite were compiled from source code using the Mi- 
crosoft VC'-h-h 6.0 compiler with the optimize for speed 
and inline where suitable settings. Several WindowsNT 
applicatioas executing a variety of tasks were also sim- 
ulated. These applications are the general distribution 
versions, and thus were compiled by their respective 
software vendors. In order to ensure examination of 
all executed user instructions, sampling was not used 
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Parameter 


- Setting 


Num BBB entries 


2048 


EBB associativity 


2- way 


Exec and taken cntr size 


9 bits 


Candidate branch thresh 


IG 


Refresh timer interval 


4096 branches 


Clear timer interval 


65535 branches 


Global cand cntr size 


13 bits 


Global cand cntr inc 


2 


Global cand cntr dec 


i 


Global mon cntr size 


12 bits ! 


Global mon cntr inc 


1 j 


Global rnon cntr dec 


1 ! 



Table 2. Hardware parameter settings. 

dui'ing trace acquisition or simulation. 

The experiments were performed using the inputs 
shown in Table 1 . The hot spot detection hardware was 
then simulated on an instruction-by-instruction basis 
for the entire execution. lu order to extract complete 
execution traces of these appHcations (all user code, 
including statically- and dynamically-linked libraries), 
we employed special hardware capable of capturing dy- 
namic instruction traces On an AMD K6 platform. This 
unit, known as SpeedTracer> was donated by AMD for 
our research. 

3.1 Hardware Parameters 

Because the design space is large, experimentally 
evaluating the individual effect of each hardware pa- 
rameter was infeasible. We, therefore, selected initial 
parameters that attempted to match the observed hot 
spot behavior and then further refined them, result- 
ing in j)arameters that exhibit desirable hot spot col- 
lection behavior. These parameters were used in the 
experiments presented in this section and are shown 
in Tabic 2. The BBB hardware was configured to al- 
low branches with a dynamic execution percentage of 
.4% (16 executions/4096 branches) or higher to become 
candidates (the candidate ratio). The H DC was config- 
ured to require that the candidate branches total more 
than 66% (2:1) of the execution to become a hot spot 
(the threshold execution percentage). The BBB was 
allowed 16 refreshes (totaling 6553t5 branches) to de- 
tect a hot spot before it was reset. To collect these 
results, we used a Monitor Table large enough to con- 
tain all hot spot branches in order to examine the hot 
spot behavior over the course of the whole execution. 

3.2 Results 

Table 3 summarizes the cflfectiveness of our proposed 
hardware at detecting runtime optimization opportu- 



nities for each benchmark. The number of hot spots 
column lists the number of times that the HDC sat- 
urated at zero, indicating the detection of a new hot 
spot. This number also represents the number of times 
that the operating system was interrupted by the BBB. 
The number of static instructions in hot spots is a close 
estimate of the total number of instructions that will 
be delivered to the optimizer over the entire execution. 
The next column, percent static executed instructions 
in hot spots, relates the number of static instructions 
in hot spots to the total number of static instructions 
executed. In other words, out of eJl the static instruc- 
tions executed by the microprocessor, only a small per- 
centage lie within hot spots. The portion of total dy- 
namic instructions represented by these hot spots is 
shown in the next column, percent total execution in 
hot spots. Because this hardware cannot detect liot 
spots instantly, some tkne that could be spent execut- 
ing in optimized hot spots is spent during detection. 
The time spent in hot spots after they eire detected 
is shown in the percent total execution in detected hot 
spots, and the time lost to detection can be found by 
taking the difference between this column and the pre- 
vious column. Finally, the last column, dynamic in- 
structions in hot spots after detection, shows the num- 
ber of dynamic instructions that could benefit from 
runtime optimization. This number reflects any subse- 
quent reuses of detected hot spots. For the purposes 
of our experimentation, when an instruction is part of 
several hot spots, we heuristically attribute the instruc- 
tion to the most recently accessed of the relevant hot 
spots. 

Analysis of the results shows that only a small per- 
centage, usually less than 3%, of the static code seen 
by the microprocessor executes intensively enough to 
become hot spots. Since a large percentage of the dy- 
namic execution is represented by a small set of instruc- 
tions, often nearly 90% of the program's execution, a 
runtime optimizer could easily focus on this small set 
with the potential for significant performance increase. 
In addition, only about 1% of the possible time spent 
in optimized hot spots is lost due to detection. For 
example, in ISO.li, the number of hot spot static in- 
structions comprise only 3.00% of the total static in- 
structions, yielding a total hot spot code size of 1447 
instructions. Furthermore, 90.83% of the entire execu- 
tion is spent in detected hot spots. Our analysis shows 
that ideally the hot-spots account for 91.2S%, and, 
thus, only .40% is lost during the detection process. 
This indicates that our Hot Spot Detector makes the 
identification so swiftly that the execution of hot spot 
regions falls almost entirely within potentially runtime 
optimized code. 
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Benchmark 


# hot 


# static 


% static 


% total 


% total 


Ovti inctc 




spots 


insts. in 


executed 


exec, in 


exec, in 


in hot 






hot spots 


insts. in 
hot spots 


hot spots 


detected 
hot spots 


spots after 
detection 


O99.go 


6 


2398 


3.46 


37,84 


35.39 


'll 7M 

0 1 . ( iVi 


124.m88ksim 


4 


1576 


2.78 


93.03 


92.30 


llOM 


I26.gcc 


47 


17665 


8.90 


58.42 


52.12 


617M 


129. compress 


7 


918 


2.12 


99.93 


99.81 


2.87B 


130.1i 


8 


1447 


3.00 


91.28 


90.88 


137M 


132,ijpeg 


8 


2556 


3.48 


91.07 


91.00 


1 42B 


134. perl 


5 


1738 


2.13 


88.43 


85.99 


2.01B 


147. vortex 


5 


2161 


1.76 


72.30 


71.93 


1.58B 


MSWord(A) 


5 


3151 


1.17 


91.36 


91.08 • 


296M 


MSWord{B) 


21 


12541 


2.40 


69.13 


62.04 


566M 


MSExcel 


25 


18936 


2.94 


60.01 


54.85 . 


88.2M 


PhotoD.(A) 


20 


5485 


1.68 


94.31 


90.97 


. 354M 


PhotoD.(B) 


14 


4192 


1.78 


94,24 


90.81 


98.5M 


Ghostview 


33 


8938 


2.82 


73.39 


72.55 


2.30B 



Table 3. Summary of the hot spots found in the benchmarks. 



code size of the hot spot, while the line graph shows the 
percentage of the execution spent in that hot spot after 
detection. This benchmark consists of three primary 
hot spots: hot spots 1, 4, and 5 on the graph. These 
correspond to the three hot spots in Figure 1 in the 
introduction (note that in the histogram, 134.perl was 
compiled for the IMPACT architecture without inhn- 
ing). Ytom the histogram, code can be seen executing 
between the first and second primary hot spots. Hzind 
analysis did not classify code in that region to be hot 
spots, but the heirdware did as fictual hot spots 2 and 
3. Upon further hand examination of actual hot spot 
3, an intensely executed region of code is found, thus 
verifying the hardware's determination. In hot spot 3, 
the cmd.exec function loops 117k times calling str.free 
in each iteration. The 9 blocks, totaling 43 static in- 
structions, contribute 7.3M dynamic mstructions to the 
program's total execution. Because of the intense na- 
ture of these blocks, they serve as good candidates for 
runtime optimization. Analysis of this benchmsirk also 
shows that, one Iiot spot is much more dominant than 
the others in terms of dynamic execution. In this case, 
optimizing only hot spot 4 could benefit over 58% of 
the dynamic instructions executed. 

Similar characteristics were observed in the other 
benchmarks. Figure 6(b) shows an example from one 
of the pre-compiled WindowsNT applications. For this 
benchnfiark, there are a few hot spots that each repre- 
sent 8% or more of the total execution which together 
represent more than 50%. We also see quite a few hot 
spots with small static code sizes, indicating tight, in- 
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Figure 6. Detailed hot spot statistics. 

Examining individual hot spots reveals interesting 
characteristics of program behavior. Figure 6(a) de- 
tails the detected hot spots from the 134. per I bench- 
mark. For each hot spot, the bar graph shows the static 



tensely executed code. In fact, for this benchmark, the 
smaller-sized hot spots are also those with high total 
execution percentages, indicating excellent opportuni- 
ties for runtime optimization. 

The benclunark 099. go is a notable example of a 
benchmark without obvious hot spots. While this game 
simulation repetitively executes players' moves, each 
move touches a large amount of static code with lit- 
tle temporal repetition. The hardware was still able 
to detect six hot spots representing 35% of the execu- 
tion. There is one primary hot spot that represents 
28% of the execution with a static code size of 1170 
instructions. 

Our data has shown that the static sizes of the de- 
tected hot spots vary significantly, fj-om tens of instruc- 
tions to the low thousands. At this point in the devel- 
opment of runtime optimization technology,, the max- 
imum static code size that a runtime optimizer could 
handle is an open question. The goal of our work is to 
quickly and accurately identify the hot spots, without 
placing too many artificial restrictions on the charac- 
teristics of the hot spots detected. We believe that it is 
better to submit a larger region of code to the runtime 
optimizer and rely on the optimizer to pare down the 
code size using all available information that has been 
gathered about the hot spot. However, by adjusting 
the existing hardwzire parameters it is easy to limit the 
size of the hot spots detected. 

The dynamic execution lengths of the individual hot 
spots also vary significantly. In some cases, the ex- 
ecution lengths of the hot spots are fairly small but 
still account for hundreds of thousands of instructions. 
These lengths may be too short to benefit from runtime 
optimization, especially when the performance gains 
earned from the optimized code must offset the cost of 
detection and reoptimization. Unfortunately, the total 
execution time is not known at runtime. The operating 
system will have to make further decisions about which 
hot spots to actually optimize. 

3.3 Implementation Cost 

For our experiments wc used a BBB large enough to 
guarantee accurate results .for all of our benclimarks. 
With 2048 entries available, very few cache conflicts 
occurred, and 9-bit execution and taken counters were 
sufRcient to capture profile data for most hot spots. 
For an equivalent hardware implementation, each BBB 
entry would contain a 22-bit tag field, two 9-bit coun- 
ters, and a one-bit candidate fiag. Thus, the total BBB 
hardware cost is slightly over 10 kilobytes. Compared 
to AMD's K7, which is reported to have a 128KB on- 
chip Ll cache, this size seems justifiable. Although a 
variety of hardware schemes may be used to implement 



the Monitor Table, it is likely to require less hardware 
than the BBB. A Monitor Table (without a coarse- 
grained component) that accommodates 3000 branch 
entries would achieve ideal results for our benchmarks. 
The Monitor Table, however, does not require pro- 
file counters, and, furthermore, a clever implementa- 
tion may use a single entry to represent several nearby 
branches. 

Next, we considered the cycles- lost to reporting the 
hot spots to the OS. These cycles offset some of the 
. gains made by executing in optimized code. The op- 
erating system is only interrupted when a hot spot is 
reported by the BBB. A rough estimate can be made 
about the time required to trfuisfer the contents of the 
BBB to operating system memory. If we conservatively 
suppose that each of the BBB's 2048 41-bit entries re- 
quii*es three cycles to transfer, then the net cost per 
hot spot is 6141 cycles. Even for 126.gcc with 47 hot 
spots, the net cost is 2S8K cycles, which is negUgible 
when compared to the number of instructions (617M) 
spent executing within the hot spots. 

4 Related work 

Several strategies have been proposed to collect pro- 
file data, and ideas from those implementations have 
motivated our efforts to create the Branch Behavior 
Buffer. Other recent efforts, however, have focused on 
estimating average progrcun behavior through statisti- 
cal sampling. Consequently these techniques rely on 
heuristics used during static analysis to discover im- 
portant regions of code. To our knowledge only our 
method addresses the unique requirements of runtime 
optimization. 

Conte, Menezes, and Hirsch [13] also propose the 
use of dedicated hardware for profiling. Their profile 
buffer is in the retirement stage and interacts with the 
reorder buffer to determine whether a branch ha.s been 
taken. The operating system periodically samples the 
profile buffer, gradually constructing an arc- based pro- 
file for use in static compiler optimizations. Various 
techniques are explored to reduce contention among 
branches within the profile buflfer, and similar schemes 
could be used to lower the hardware cost of our Branch 
Behavior Buffer. 

The Profile Me approacli taken by Dean et al (14) 
collects detailed information about a single sampled 
instruction and the interaction between a pair of sam- 
pled concurrent instructions. They capture interesting 
events such as cache misses and branch mispredictions 
so that an in-depth performance analysis of the mi- 
croarcliitecture can be performed offline. 

The Morph system, developed by Zhang et al [3] uses 
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operating system and compiler technology to provide a 
framework for continuous profile-driven optimization. 
Althougli our proposed hardware greatly reduces the 
dememd for software support, a low-overhead OS layer 
is still necessary. Because Morph demonstrates the fea- 
sibility of such an OS infrastructure, it is orthogonal to 
our hot spot detection mechanism. 

Although not the focus of this paper, we envision 
utilizing our binary reoptimization technology and im- 
plementing region-based runtime optimizations [15] us- 
ing our hot spot profile data. Our isolation of program 
hot spots also lends itself to the hot-cold optimization 
strategy proposed by Cohn and Lowney [16]. By ex- 
cluding the infrequently used, or cold, instructions and 
concentrating on the intensely used, or hot, instruc- 
tions in a hot spot, we can potentially derive significant 
performance benefit. 

5 Conclusion 

Traditional profiling focuses on collecting data to 
be used for compiler optimizations in a compile-run- 
recompile methodology. This approach restricts ag- 
gressive optimizations to a specific workload that is 
assumed to be representative. We have presented a 
method that dynamically identifies hot spots within a 
program and collects sufficient data to provide a frame- 
work for timely runtime optimization during the same 
execution. Oui- scheme requires minimal operating sys- 
tem support, incurs negligible runtime overhead, and 
requires a few hardware tables and counters located off 
of the critical execution path to identify and monitor 
the hot spots. 

Detailed trace-driven simulations were performed 
for a number of integer progreims, including bench- 
marks from SPEC95 and several WindowsNT appli- 
cations. The resulting data presented in this paper 
showed tliat the hot spot detection algorithm, when ap- 
plied to many common benchmarks and applications, 
ach ievcs a high rate of success and provides a promising 
framework for building a runtime optimizer for generzd 
purpose programs. Subsequent efforts will be directed 
toward implementing several region-based runtime op- 
timizations using the hot spot data and further explor- 
ing methods of delineating hot spots in the code. 
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Abstract 

Profile-based optimizations can be used for instruc- 
tion scheduling, loop scheduling, data preloading, 
function in-lining, and instruction cache performance 
enhancement. However, these techniques have not 
been embraced by software vendors because programs 
instrumented for profiling run 2-30 times slower, an 
awkward compile-run-rccompiU sequence is required, 
and a test input suite must be collected and validated 
for each program. This paper proposes using existing 
branch handling hardware to generate profile infor- 
mation in real time. Techniques are presented for 
both one-level and two- level branch hardware organi- 
zations. The approach produces high accuracy with 
small slowdown in execution {0A%~4.6%). This al- 
lows a program to be profiled while it is used, elim- 
inating the need for a test input suite. This prac- 
tically removes the inconvenience of profiling. With 
contemporary processors driven increasingly by com- 
piler support, hardware-based profiling is important 
-for high-performance systems. 

1 Introduction 

Advanced compilers perform optimizations across 
block boundaries to increase instruction-level paral- 
lelism, enhance resource usage and improve cache 
performance. Many of these methods, such as trace 
scheduling [1], and superblock scheduling [2]. either' 
rely on or can benefit from information about dy- 
namic program behavior. For example, traditional 
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optimizations enhance performance by an additional 
15% when combmed with profile-driven superblock 
formation [2]. Other examples include data preload- 
mg [3], improved function in-lining [4], and improved 
instruction cache performance [5]. 

There are several drawbacks to profile-driven op- 
timizations. Many of the techniques can result in 
code size explosion if they are performed too aggres- 
sively. Dynamic basic block execution frequencies can 
be used to reduce this phenomenon. More problem- 
atic is the task of profiling itself. Obtaining pro- 
file data through software methods can be complex 
and time consuming, requiring additional steps in the 
compilation process. The usual method employed is a 
compiU-run-recompiU sequence. First, the program 
is compiled with profiling probes placed within each 
basic block ^ The program is then run using several 
different test inputs. The resulting profile data is used 
to drive a profile-based compilation of the original 
program. 

Execution of the profiled version of the program is 
slow. Wiih some methods, the profiled version runs 
30 times slower than the optimized program. At best, 
a profiling program can be expected to run two times 
slower. In addition, test inputs need to be careful Iv 
chosen {6), [7]. 

Static estimation solves some of the problems re- 
lated to gathering profile data [6], However, these 
techniques are not as accurate as profiling [6], [7). 
When used for superblock scheduling, static estimates 
achieve approximately 50% of the speedup that pro- 
filing can achieve [S]. 

Many commercial microprocessors, such as the 
Pentium series [10] and the PowerPC 604 [11]. in- 
corporate some form of branch handling hardware. 
This paper proposes using existing branch handling 
hardware, along with OS support, to obtain profile 
information; Using this, the slowdown for profiling is 

' The profiling probe* are extra instruct-ions which log th« 
execution of a basic block at run time. 
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imperceptible (e.g.. U.-lCi -4i)*/Y im rcase). This aUows 
an application to be deployetl in the field an<l later re- 
trieved for profile-based reconipilation. Since it cap- 
tures actual usage, it solves the problem of obtaining 
valid test inputs for profiling. It also allows profiling 
of real-time applications and system software. Using 
dynamic information improves the accuracy of static 
techniques. In general, the techniques presented in 
this paper solve many of the problems with profiling 
and expand the usefulness of profile-driven optimiza- 
tion. 

The following section reviews several hardware 
branch prediction mechanisms, along with published 
mechanisms that out-perform those currently imple- 
mented. Methods for deriving profile information 
from hardware are discussed in the third section. Al- 
though these methods are less accurate than full- 
fledged profiling, they are significantly more accurate 
than static estimates. Metrics to measure this er- 
ror are discussed in Section 3.4. The fourth section 
presents experimental results and discusses the trade- 
offs between the various schemes. The paper closes 
with recommendations for hardware-supported pro- 
filing, many of which can be implemented today in 
existing systems. 

2 Branch Prediction and Profiling 

O 

There are several contemporary dynamic branch pre- 
diction mechanisms that have been implemented in 
commercial processors. This section briefly reviews 
these schemes. A graph representation for profile in- 
formation is also presented, along with two methods 
for grouping basic blocks into larger structures. 



2.1 



Contemporary branch handling mecha- 
nisms 



There are two classes of branch prediction methods: 
one-level and two-level schemes. One-level schemes 
use the address of the branch instruction to index 
into a branch target buffer (BTB), which contains a 
small state machine for predicting the outcome of a 
branch. When the branch completes execution the 
ictual outcome is used. to update the state machine. 
Pigure I depicts this process. The most common 
itate machine for one-level schemes is the two-bit 
:ounter predictor, described in [12J. This predictor 
s implemented in several contemporary processors. 
The nominal size for the one-level branch prediction 
:)uff'er is between 512 and 1024 entries. Our experi- 
nents show that the two-bit counter, when used with 
I 1024-entry BTB, achieves a branch prediction ac- 
:uracy of 90% on-average across the SPEC92 bench- 



Brimch Tari;ft Bufftr 



in.Mruciion addfc.s> 




upd^tte ^j^^ behavior 

Figure i; One-level branch prediction. 



marks. 

Two-level schemes use two separate buffers. The 
first buffer is indexed similar to the BTB and stores 
the branch history as a binary string. The second 
is indexed using this branch history and stores the 
state of a predictor. This is depicted in Figure 2. 
These schemes have been studied extensively by Yeh 



instntcinin xltlm.^ ■ 



HiiU,ry Rtgwer Table Paitem Table 

indtt 



hm.Tfy. IH(l|...n 




predict 



I fPriffff f'U'Cy 1 ttniete state behavior 

Figure 2: Two-level branch prediction. 



and Patt [13],[14j and we will use their nomenclature 
here. The first level buffer is termed the history regis- 
ter table (HRT). The HRT is b bits wide and stores a 
sequential, binary string of the branch's history, using 
0 for not-taken and I for taken branches^. A predic- 
tion is made by indexing into the HRT, then using the 
history string to index into a second table, the paitem 
table (PT). The PT stores the state of a small state 
machine used to predict the branch. This decouples 
the branch prediction from the address of the branch 
instruction. The effect of this decoupling is dramatic. 
Yeh's algorithm can achieve 96% branch prediction 
accuracy for SPEG92 benchmarks [13l,[14]. As of to- 
day. Yeh's algorithm has not been implemented in 
any commercially available microprocessor. However, 
the needs of wide-issue superscalars will likely drive 
future implementations of this branch predictor. 

2:2 Weighted control flow graphs 

Profile-driven optimizations use a structure known as 
a weighted control flow graph (WCFG).- which is a 

^The "PAs" scheme with a 1024-entry HRT (6 = 12) is used 
in this paper. 
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directed graph with basic-blocks as nodes. Arcs in a 
VVCFG are due to one of two occurrences: either a 
code label or a branch instruction. An unweighted 
CFG for each function can be determined statically 
by the compiler. 

A WCFG can be used to form larger groupings of 
blocks, which in turn can be used to enhance the 
scope of optimization and scheduling. Examples of 
these structures include Fisher's traces [1], and the 
IMPACT project superblocks [2]. Chang, et al re- 
port a speedup of 15% when superb locks were used 
to extend the scope of traditional optimizations [2] 
Superblock formation and trace selection both use 
the same heuristics to form traces. Superblocks differ 
from traces in the method for providing fix-up code 
for off- trace /superblock execution and tail duplica- 
tion [2], [9], [16]. Either method results in significant 
code size explosion. To limit this explosion, a thresh- 
old is placed on the execution frequency of a block. If 
a block's frequency is below this threshold, it is not 
considered for trace membership. (This is discussed 
in more detail in Section 3.4 below). 

There are several methods of recording profile in- 
formation. One method is to insert extra code at 
the beginning of each basic block that records the 
block id in a buffer. This buffer is then parsed into 
a WCFG, either periodically during execution, or af- 
ter program completes execution. One example is 
the Spike profiler, which is built into the back end of 
GNU CC [16). A disadvantage is its slowdown, which 
is approximately 30 limes for Sptkc. 

Another method used by ATi:T Global Informa- 
tion Solutions in their commercial compilers is arc- 
based profiling. In this method, a transition block is 
added to the code to record the execution along an 
arc [17], The target of the branch is changed to this 
new transition block, and an unconditional branch to 
the original destination is added to the end of the 
transition block. A table of all possible arcs is added 
to the object code by the compiler. An instruction 
to increment an arc's table entry is placed inside the 
transition block. When implemented, arc-based pro- 
filing results in a slowdown by a factor of two. Of the 
profiling approaches, arc-based profiling is the best 
suited to hardware adaptation. 

2.3 The drawbacks of software profiling 

Although the benefits of profile-driven optimization 
are large, there are many drawbacks to collecting 
profiles in software. The most severe is execution 
slowdown over unprofiled code. Slowdown is more 
than a minor inconvenience. Experience at ATi:T 
Global Information Solution? ha.s shown that slow- 



down is the major reason why profile-driven optimiza- 
tions have not been adopted by the user community. 
Real-time applications such as kernels and embed- 
ded systems are excluded from the benefits of profile- 
driven optimizations. Long-running applications such 
as database systems are often excluded from profiling 
as well. 

Another problem of profiling is the selection of in- 
puts for the profiling task. Programs that are highly 
data-dependent, such as a sort routine (simple) or a 
database application (complex), have branches that 
are sensitive to user inputs. If the inputs are not 
selected carefully, the profile will not reflect actual 
usage. Validating profiling is difficult without a large 
scale study of user habits. In the absence of this, 
profiling is typically done using a large set of inputs, 
further increasing the time required for accurate pro- 
filing. 

A third problem with profiling is the method for 
its use.. A program must be compiled with profil- 
ing enabled, run using the test inputs, and then re- 
compiled. For small programs, this is not difficult. 
For large systems, such as OS kernels or commer- 
cial database applications, this requires significant al- 
teration of Makefile scripts [18]. A large amount of 
man-hours is invested in these scripts. For this rea- 
son, software vendors are hesitant to adopt any profile 
driven optimizations. 

Ball and Larus have developed a set of heuristics to 
determine which arcs are more likely to be traversed 
in a CFG [8]. An extension to these that estimates 
node weights is presented in [19]. Such static heuris- 
tics can be used to solve many of the problems of 
profiling, but with less accurate results. For exam- 
ple, when static estimates are used to predict branch 
directions, the inaccuracies of the predictions are ap- 
proximately twice that of profiled information [19]. 

3 • Using Branch Prediction Hardware for 
Profiling 

The goal of this paper is to demonstrate that the con- 
tents of hardware branch buffers can be used to add 
weights to a statically-built CFG. Most commercial 
processors allow the serial scan-out of state informa- 
tion for testing purposes. In addition to this, several 
processors implement kernel-mode instructions for 
reading branch hardware buffers directly. Hardware 
implementations typically include target address in- 
formation along with prediction information. The 
combination of the target address (the destination of 
the arc), the buffer lag (the source of the arc), and 
the prediction information (the arc s weight), fully 
specify an arc in the WCFG. 
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The specific procedure for producing a W'CFG is 
as follows: (I) A program is compiled with a special 
identifier token (magic number), mdicating it con- 
tains a table of CFG arcs. (2) During execution, the 
kernel periodically reads the buffer and uses its con- 
tents to increment the arc counters. This period may 
be at every context switch, or more frequently. (3) 
On exit, the arc table is updated on disk. When 
branch hardware for profiling was implemented us- 
ing a Pentium-based ATLT server system, results 
show an imperceptible difference in execution time 
between programs modified in this way and unmod- 
ified, traditionally-optimized programs. This slow- 
down is shown in Table I for five of the SPECint92 
benchmarks. The maximum is for espresso, with a 
slowdown of just under o%. 

Table 1; Slowdown due to hardware profiling. 







Hardware 




Benchmark 


Unprofiled 


profiled 


Slow- 


time (sec) 


time (sec) 


down 


compress 


95.6 


98.4 


0.8% 


eqntott 


31 7 


31.9 


0.6% 


espresso 


45.4 


47.6 


4.6% 


gcc 


110.2 1 


114.0 


3.3% 


xlisp 


91.4 1 


91.8 1 


0,4% 



3.1 Code adjustments to support arc-based 
profiling 

There are two adjustments that need to be made 
to convert hardware branch information into arc 
weights. Indirect jumps can produce blocks with 
more than two outgoing arcs, reducing the one-to- 
one mapping between a buffer entry and an arc. The 
two primary sources of indirect jumps in C are due 
to call-through-pointers and switch statements. Call- 
through-pointers are not problematic, since trace se- 
lection is traditionally performed on a per-function 
basis. Switch statements can be converted into a 
chain of if statements. The performance lost from 
this conversion is later regained when the cases of the 
switch are sorted according to execution weight. The 
side-effect of this conversion is an increased branch 
target address predictability. 

The second adjustment concerns code labels. Basic 
blocks formed due to code- labels are never allocated 
entries in the hardware buffer. A solution to this 
problem is to use the structure of the static graph to 
propagate profile information to these blocks. This 
can be done when the program is recompiled, and 



does not need ro he donr ai execution time. 
3.2 Two-level profiling 

Slight modifications, are required to adapt two-level 
schemes for the recording of arc weights. Since these 
buffers store a history of branch behavior, counting 
the number of I s in a history register and dividing 
by the register width can be used to estimate the 
weight of an arc (we will refer to this as dumpzng the 
history register). After a history register is dumped 
It must be updated in some fashion so that its con- 
tents are not over-counted at the next dumping point 
Since dumping the buffer may occur at context switch 
points, there is no point in preserving the contents of 
the history registers. For these reasons, the history 
registers are initialized to 0 after being dumped^. 

Zeroing the history registers does not solve the 
problem of over-counting an arc's weight. Entries 
of *0' in the registers signify not-taken (fall-through) 
branches- A mechanism is needed that determines 
which -O's are due to actual execution and which are 
left over from the last buffer dump. Several tech- 
niques were experimented with, including marking 
each history with a dirty bit. In the dirty bit scheme, 
extra bits are added, one per history entry. These 
bits are cleared at each dump point. When a branch 
indexes mto a history register, the bit is set. Since 
the HRT is a cache-like structure, it will contain a 
tag store. The "dirty" state can be marked by stor- 
ing an invalid, impossible tag value for the history 
register entry. Thus, a dirty bit can be implemented 
with little hardware modification. 

For frequently-encountered branches, the dirty bit 
scheme will produce accurate, results. However, if 
a branch is encountered infrequently, the *0* entries 
from the buffer dump may still remain in a register 
even though the dirty bit is set. This can increase the 
error for moderate-to-lightly visited basic blocks. The 
results in the following section support this claim. 

Another scheme is to use a marker bit to record 
the boundary between the valid and invalid branch 
histories, as illustrated in Figure 3. After each dump, 
the registers are zeroed and the LSB of each register 
IS set to T (i.e.. 00 - -01'). As the branch updates 
Its history, this bit shifts to the left. An extra full 
tit is maintained in the MSB of the register. When 
full bit = 1, the entire contents of the register are 
treated as valid at a dump point. Otherwise, only 
the positions to the right of the leading T in the 
register are valid (i.e., if '00 • ■ - Olirxr'. then only 

^ Our experiments show thac this causes negUgible change in 
the prediction accuracy. This is because context switches wiU 
normally result in a partial or near-total flush of the buffer 
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the xzxz bits are valid). To complete this scheme, 
the /u// 61/ is zeroed at each dump point. Note that 
the marker bit is not wasted space. The entire history 
register holds a valid history once the full bit is set. 
Therefore, only one additional bit per history entry is 
required, regardless of the length of the HRT entries. 



History Register Table 

Initially (after dumping) 

full buip fO 0 0 0 0 0 0^ ;i 

invalid histor>- marker bil 

After hislorj-: TNTT : 

full bitl blO:D 0 ^1 1 0 I 1 ~ 

marker bii 

After history: 'mrmNNT ' 

i 

full bil l 1 0 1 1 0 0 0 11 i 



Figure 3: The marker bit modification to Yeh's aigo^ 
rithm for arc weight calculation. . 



3.3 One-level profiling 

Modification of one-level schemes is less complicated, 
but" also less accurate. Some indicator of the valid- 
ity of a branch history is still required. For one-level 
schemes, this can be implemented using a dii^y bit. 
As with the two-level schemes, this can be imple- 
mented without any modification if the BT3 main- 
tains a tag store. 

The inaccuracy for one-level schemes is a result of 
using a two-bit counter to estimate the number of 
times a branch is taken or not-taken. As with the two- 
level scheme, several approaches were tried until one 
was found that achieved highly accurate performance. 
This scheme relies on the ability to keep a count of the 
total number of instructions executed since the last 
history dump. This is relatively easy since most mod- 
ern processors have cn-chip performance monitoring 
hardware to record such information^. Given that N 
instructions were executed, the compiler (and hence 

* In the absence of moniloring hardware, an approximation 
can be obtained using the bufTcr .sampling rate and the proces- 
sor's CP! rating. 



the dumping routine) can approximate the average 
number of instructions per basic block, B. Then, if 
d entries in the BTB are dirty, each entry is assumed 
to be touched w = A7(B x d) times. This value is 
then used to translate the two-bit counter value into 
increments to the arc counters. This translation is 
presented in Table 2. The reason for the success of 
these approximations is discussed in Section 4 below. 



Table 2; Approximations used to convert BTB entries 
into arc weights. 



Counter 


Value Arc lo 


Increment 


value 


interpretation increment 


vaJue 


00 


strongly not-taken fall-through 


2u' 


01 


weakly not- taken faJl-through 


w 


10 


weakly taken target 


w 


11 


strongly taken target 


2w 



(where w = N/[B x d)) 



3.4 Comparing profiles 

Validation of hardware profiling is done by compar- 
ing traditionally-generated profiles {actual profiles) to 
hard ware- generated profiles {estimated profiles). One 
method for this is to perform trace selection on both 
the actual and the estimated profiles and compare the 
results. An example of trace selection is illustrated in 
Figure 4. Graph (a) is annotated with the actual pro- 
file information, whereas graph (b) is the hardware- 
generated profile. Traces are formed using an arc 
trace selection threshold of 60% to group blocks [15]. 
Code explosion is avoided by not extending traces to 
blocks with low weights. This is implemented as a 
threshold, T. Values of T = 01%, 1%, 3% and 5% 
are considered below. 

The metric for trace selection error is introduced 
using the example of Figure 4. In the actual graph 
(graph (a)), basic blocks 1. 7. 11 and 13 are grouped 
together to form a trace. Due to errors in the weights 
of outgoing arcs for block 7, the blocks 1, 7, 8 and 9 
are grouped to form a trace in the estimated graph. 
The error for block ' 7 is due to the difference in 
arc weights between the two graphs. The transition 
from block 7 to 8 will occur 0.15 x 0.1 = 1.5% of 
the total execution time. Similarly, the transition 
from 7 to 11 will occur 0.15 x 0.9 = 13.5% of the 
time. (Since the actual graph contains the real exe- 
cution frequencies of the program, these frequencies 
are used.) Hence, the transition from 7 to 11 occu- 
pies a higher percentage of the total execution. The 
trace in graph (b) incorrectly assumes the transition 
of 7 to 8 is more likely. This assumption is wrong 
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Figure 4: Trace selection exampJe. 



for 13-5%- 1.5% = 12% of the execution. The figure 
of 12% IS therefore the percentage of execution time 
that the incorrect trace membership will be exercised. 

In general, the trace selection error is the total per- 
centage of execution time that incorrect trace mem- 
bership IS exercised due to errors in the estimated 
profile. The trace selection error {TSE is formally 
specified by taking H', to be the normalized weight 
(or. execution frequency) of block t. Let Wij be the 
normalized weight of the arc from block i to block 
;. The trace selection error is calculated using the 
following procedure; 

1. TSE= 0 

2. forall blocks i 6 WCFG{aci ua/) do: 

(a) j — 'T-acc_5ucce5sor( WCFG( actual), I ) 

(b) ''*a«_sacccs5or(VVCFG(esiimate),i) 

(c) if j ^ k then TSE — TSE + K', x 

3. enddo 



Another method for comparison is the distribution 
of arc weight error versus block weights. This metric 
is useful smce it shows where the trace selection error 
.s occurnng. The distribution is calculated bv com- 
puting the maximum differences between the'actual 
and the estimated arc weights for each categorv of 
block frequencles^ Let u,,, be the weight from', to 
J in the estimated (hardware-generated) profile and 
Wij be the weight for the actual profile. Define the 
maximum difference to be, 

Then the {un normalized) distribution function is. 

— ^ ^Wi, . (2) 

or the sum of the maximum arc differences for each 
block with weight W. The distribution of arc weight 
error provides good insight into the performance of 
the techniques, as is shown in the next section. 

4 Experimental results 

The three schemes for hardware-based profiling were 
tested using benchmarks from the original SPEC92 
benchmark suite as test workloads. Results are pre- 
sented here for all the integer and an equal num- 
ber of floating-point. benchmarks (see Table 3). The 

Table 3: SPEC92 benchmarks used for evaluation. 



1 Class II Benchmark 


input 


Integer 


espresso 
xlisp 
eqn tot t 
compress 

sc 


cps.in 
li-input.Up 
int_pri_3 .eqn 
in 

loadal 
tree.i 


Floating- 
point 


doduc 

nasa7 

mdljdp2 

waves 

tomcatv 

ora 


doducin 
mdlj2.dat 

params 



benchmarks are compiled using the GNU C com- 
piler with all optimizations enabled. The FORTRAN 
floating-point benchmarks are first converted from 
FORTRAN to C. 

Several approximations are made in the previous 
section to extract ar c weights from hardware. One 

*The maximum difference is. used in order to avoid over- 
counting a single error. For example, there is a 4% difference 
for two arcs with weights 40%/eo% (actual) vs. 44%/56% (es- 
timate), not an S% difference. 
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Table 4: Dynamic basic block distribution. 



Bench- 


1 Basic Blocks 












ToiaJ 


mark 




t/-5u 


E-90 




E-lOO 


static 




15 


49 


225 


842 


2838 


7582 


xlisp 


10 


34 


119 


264 


1058 


3138 


eqntott 


1 


2 


6 


34 


502 


1323 


compress 


2 


5 


17 


21 


135 


432 


sc 


2 


7 


52 


135 


1529 


4G34 


gcc 


72 


348 


2610 


6535 


14382 


34347 


doduc 


1 


7 


283 


468 


1596 


3643 


nasaT 


2 


2 


2 


2 


210 


1716 


mdljdp2 


2 


5 


15 


35 


821 


848 


waveo 


2 


14 


72 


177 


1222 


3896 


tomcatv 


3 


6 


12 


14 


372 


1318 


ora 


3 


6 


13 


24 


396 


1791 



reason that these approximations are successful is the 
relatively high locality of branch instructions. This 
fact is illustrated in Table 4. These figures repre- 
sent the distribution of unique basic blocks during 
execution. The "E-x'^ column presents the number 
of blocks that occupy r percent of the benchmark s 
execution. For example, of the 1323 branches in eqn- 
tott, only 502 are actually executed. Of these, onlv 
one branch accounts for 25'7c of the execution, and 
two branches for 50% of the execution. This table 
shows that most of the benchmarks exercise only a 
very small number of dynamic branches for the ma- 
jority of their execution. 

4.1 Performance of two-Jevel profiling 

Results for the two-level dirty-bil scheme are pre- 
sented in Table 5. The columns labeled T = y% 
are for a code explosion cutoff threshold of y%. The 
trace selection error is remarkably \ovc. even for a 
T = 0.1% cutoff. Several of the floating-point bench- 
marks achieve zero error. These benchmarks have a 
very low number of long-life dynamic branches, as 
shown in Table 4. The sources of the error are ex- 
plained by the distribution. of arc weight error, shown 
m Figure 6^ Notice that the majority of the differ- 
ence m arc weights occurs for blocks that comprise 
less than 1% of the execution. This is true for all 
the benchmarks. This shows thai the majority of 
the error for hardware profiling occurs for the iightlv- 
executed blocks. - * 

The code explosion cutoff threshold has a large ef- 
fect on the error. Lower cutoff thresholds produce 

^Only a subset of benchmarks art shown lo simplify ihe 
graph, but their behavior is typical. 



higher error for all schemes. This is because a hard- 
ware buffer only captures the most-frequently exe- 
cuted branches. Seldom-executed nodes and arcs will 
be poorly represented in the buffer. When the thresh- 
old IS 3%. to 5%, the error is zero in almost ail of the 
cases. Two exceptions are integer benchmark com- 
press and the floating-point benchmark tomcatv. For 
both these benchmarks, majority of the error is due 
to differences in two to three arc weights between the 
profiles. 



Table 5: Two level (dirty-bit) - trace selection 



error. 



Benchmark 


1 Trace selection 


error (percent) 


IT = 0.1% 


1 = = 3% 


T = 5% 


espresso 


1 16.46 


2.31 


0 


0 


xlisp 


12.23 


5.69 


0 


0 


eqntott 


3.95 


3.64 


3.64 


0 


compress 


16.16 


15.31 


6.12 


6.12 


sc 


15.37 


7.36 


0 


0 


gcc 


1 9.09 


0 


0 


0 


doduc 


2.19 


0 


0 


0 


nasa7 


0.21 


0 


0 


0 


mdljdp2 


3.80 


2.93 


0 


0 


wave 5 


5.60 


4.64 


0 


0 


tomcatv 


17.87 


17.87 


17.87 


17.87 


ora 


0.02 


0 


0 


0 
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Figure 5. Two-level {dtrly-bit): Distribution of arc 
weight error. 



Closer examination of the error for all benchmarks 
suggests that the errors in trace selection have the 
effect of reducing the scope of the profile- based opti- 
mizations. Specifically, in compress a trace composed 
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blocks 33-36-37-38 in t.lie acmal profile' \v-a.s splir 
o two traces between blocks 36 an. I 37 in the es- 
lated profile. This occurred because of an error in 
■ estimated arc frequency between blocks 36 and 
In this case the weight of tiiis arc was less than 
: trace selection threshold, preventing the trace to 
■w beyond beyond block 36. 

The primary reason that tomcati- experiences such 
h error is its lack of voluntary context switches 
system calls). Because of this, the buffer is 
nped only when the quantum expires. In the inte- 

benchmarks. system calls are relatively frequent, 
sing a higher sampling rate. The effect of sam- 
ig the buffer more frequently than once a context 
tch is examined below in Section ^.3. Increasing 

sampling rate reduces error without significant 
cution overhead". 

)le 6: Two level (marker-bit) - trace selection error. 



le-os c.Gooi 3.0C1 

Block, wv'.qnc tl09 scale) 



Figure 6: Two-level {marker-bit): Distribution of arc 
weight error. 



Benchmark 


Trace selection 


error (percent) 


T = 0,17c 


r = 17c T = 37ctT = 5% 


espresso 


12.67 


2.31 


0 


0 


xlisp 


7.75 


3.39 


0 


0 


eqntott 


3.95 


3.54 


3.64 


0 


compress 


16.16 


15.31 


6.12 


6.12 


sc 


8.46 


6.40 


0 


0 


gcc 


2.39 


0 


0 


0 


doduc 


1.21 


0 


0 


0 


nasa? 


0.21 


0 


0 


0 


mdljdp2 


3.80 


2.93 


0 


0 


waveS 


2.32 


2.32 


0 


0 


tomcat V 


17.87 


17.87 


17.87 


17.87 


ora 


0 


o{ 


0 


0 



aplementation of the marjter-6:i scheme decreases 
2 selection error over diriy-bit for the majority of 
benchmarks. For example, espresso drops from 
6% to 12.67%. The marker improves the accu- 

for lightly-executed basic blocks by increasing 
accuracy of arc weights. This can' be . seen by 
paring the marker-bii arc weight error disiribu- 

( Figure 6) to the distribution for the dirty-bti 
me (Figure 5). Observe that the error for lightly-, 
hted blocks has been significantly reduced, es- 
illy for the region between 10"^ and 0.001 (the 
ounced error crest in Figure 5). 
le marker-bit scheme does not help in all 
This is especially true when the error-causing 
ches executed fairly frequently. This is the case 
ompress, espresso and mdljdp2. 



he execution overhead is approximately 0.55%. 



4.2 Performance of one-level profiling 

Profiling using one-level branch prediction hardware 
is often less accurate than two-level because the hard- 
ware contains a less-sophisticated measure of branch 
history. This results in a higher error (see Table 7). 
The benchmarks that perform poorly for two-level, 
perform poorly here as well. The arc weight error 
distribution (Figure 7) demonstrates why this is true. 
Even though the overall shape of the graph resembles 
the other two. there is a higher concentration of arc 
weight error between block weights 0.1 (10%) and 0.3 
(30%). Because of this, some blocks that are rela- 
tively frequently accessed get incorrectly selected. 

It is interesting to note that in some cases the one- 
level profiling is more accurate than the two-level 
schemes. This can be seen for benchmarks such as 
sc, compress, or eqnioii. This is a consequence of es- 
timating the block weight accurately. The weights 
are estimated using the techniques outlined in Ta- 
ble 2 of the previous section. This method predicts a 
branch's execution frequency based on how many in- 
structions were executed since the last buffer dump. 
For the twcK level schemes, the analog of this count 
is the width of the history register. This count satu- 
rates at 12. when the entire history register contains 
valid entries. * 

4.3 The effects of sampling rate on error 

The results above show a relatively high error for 
benchmarks that sample the buffer only on quan- 
tum expiration. The worst case of this is the iomcaiv 
benchmark. The effect of making the sampling rate 
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Table 7: One-leve! - trace seleciion error (percent). 



Bench ina.rk 


Trace selection error 


T = 0.1% T = 17c T = 3% 


T = 5% 


espresso 


16.49 


5.51 


0.02 


0 


xlisp 


8.51 


2.30 


0 


0 


eqntott 


3.66 


3.64 


3.64 


0 


compress 


11.92 


10.59 


2.65 


2.65 


sc 


3.46 


1.14 


0 


0 


gcc 


5.99 


0 


0 


0 


doduc 


2.86 


0.14 


0 


0 


nasal 


0.42 


0.12 


0.06 


0 


mdljdp2 


2.80 


2.20 


0 


0 


wa,ve5 


7.03 


6.06 


0.07 


0.07 


tomcat V 


25.28 


25.19 


24.34 


24.34 


ora 


0.02 


0 


0 


0 



Table 8; The effecLs of sampling rate on error (tom- 
caiv). 



Sampling 


Two- level (marker- bit) 


rate 


T = 0.1% 


T = 17c 


T = 37c 


T 


= 57c 


4x 


17.87 


17.87 


17.87 




17.87 


8x 


17.87 


17.87 


17.87 




17.87 


16x 


0 


0 


0 




0 




Two-level {dirty-bit 






T = 0.1%, 


T = 1% 


T = 3% 


T 


= 57c 


4x 


17.87 


17.87 


17.87 




17.87 


8x 


17.87 


17.87 


17.87 




17.87 


16x 


0 


0 


0 




0 




One-level 






T = 0.1% 


T= }% 


T= 3% 


T 


= 5% 


4x 


24.53 


24.34 


24.34 




24.34 


8x 


19.85 


19.85 


17.90 




17.90 
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13.22 


13.04 


13.04 
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Figure 7: On e-level: Distribution of arc weight error. 



more frequent than the quantum are a reduction in 
error. Thts is presented in Table 8. 

The error drops to zero for the two-level schemes 
when the sampling rate is increased to dump the 
buffer 16 times more frequently than the normal con- 
text switching rate. This is an important result, since 
it indicates a fast sampling rate can reduce even the 
highest error. Although the one-level approach im- 
proves with sampling rate, it does not go to zero. In 
general, if the scope of optimizations is limited due 
to trace selection error, the hardware buffer' can be 
sampled more frequently. 

Increasing the sampling rate does not appreciably 
affect execution time. For example, on a Pentium- 
based ATA:T server the nominally hardware-profiled 
iomcaiv takes 54.2 sec to execute, whereas interrupt- 
ing tomcatr IG times more frequently takes 54.5 sec 



to execute^. The reason is that a buffer dump is not 
a full context switch. No change of context occurs. 
Sampling rate can be increased without significant 
performance impact, provided there is kernel support. 
The only performance degradation comes from flush- 
ing the buffer. 

5 Concluding Remarks 

This paper has presented a method for obtaining 
profile information without significant run- time slow- 
down (e.g., 0.47o-4.67o). This makes the compile- 
use-recompile approach presented here is much easier 
for software vendors than the traditional compile-rvn- 
recompile method of profiling. Using our techniques, 
software vendors can supply profiled versions of ap- 
plications to alpha- and beta-testers, later collecting 
the profiles for final profiled optimizations. No sam- 
ple suite of inputs is required. The longer the profiled 
version remains in the field, the higher the probability 
that the profiles match day-to-day use. Without the 
need for input sets, profiling can be used to optimize 
interactive, real-time, and system software packages. 

All of the features required to use our techniques 
are already present in commercial processors. Most 
of these processors have branch target buffers, many 
that employ two-bit counter predictors. The trace se- 
lection error for these schemes is quite small. When 
the error does occur, it has the effect of limiting the 
scope of the optimization, which has few detrimental 
effects. It is important to note that the experimental 



'Each of ihcsc experiments was run immediately after a 
reboot and (he results are reproducible. 
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results are for a single run of each benchmark. The 
profiles are likely to converge after multiple runs, re- 
ducing the error still further. 

Future superscalars will require branch prediction 
techniques more sophisticated than one-level BTB's. 
such as two-level approaches. Two schemes were pre- 
sented to add hardware profiling to two-level mech- 
anisms. Both schemes perform well for frequently- 
executed blocks. In addition, the marker-bii scheme 
performs well for moderately-executed blocks. The 
hardware overhead required for this scheme is mini- 
mal (specifically: one additional bit per HRT entry). 

In general, the techniques presented here signifi- 
cantly reduce the inconvenience of profiling. With 
contemporary microarchitectures driven increasingly 
by compiler support, hardware-based profiling is im- 
portant for continued improvements in processor per- 
formance. 
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Abstract 

Dynamic optimization of computer programs can dramat- 
ically improve their performance on a variety of applica- 
tions. This paper presents an efficient infrastructure for 
dynamic recompHation that can support a wide range of 
dynamic optimizations including profile -driven optimiza- 
tions. The infrastructure allows any section of code to 
be optimized and regenerated on-the-fty, even code for 
currently active procedures. The infrastructure incorpo- 
rates a low-overhead edge -count profiling strategy that 
supports first-class continuations and reinstrumentation of 
active procedures. Profiling instrumentation can be added 
and removed dynamically, arui the data can be displayed 
graphically in terms of the original source to provide use- 
ful feedback to the programmer. 

1 Introduction 

In the traditional model of program optimization and com- 
pilation, a program is optimized and compiled once« prior 
to execution. This allows the cost of program optimiza- 
tion and compilation to be amortized, possibly, ovei may 
program runs. On the other hand, it prevents the compiler 
from exploiting properties of the program, its input, and its 
execution environment that can be determined only at run 
time. Recent research has shown that dynamic compilation 
can dramatically improve the performance of a wide range 
of applications including network packet demultiplexing, 
sparse matrix computations, pattern matching, and mobile 
code [10. 8. 13. 16. 22, 23]. Dynamic optirnization works 
when a program is staged in such a way that the cost of 
dynamic recompilation can be amortized over many runs 
of the optimized code [21]. 

This paper presents an infrastructure for dynamic re- 
compilation of computer programs that permits optimiza- 

*This maiehal U based on work supported in pan by the Naiional Sci- 
ence Foundation under grant numbers CDA-9312614 and CCR-971I269. 
Robert C. Bui^ger was supported in part by a National Science Foundation 
Graduate Research Fellowship. 
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tion of code at run lime. The infrastructure allows any 
section of code to be optimized and regenerated on-the- 
fly. even the code for currently active procedures. In our 
Scheme-based implementation of the infrastructure, the 
garbage collector is used to find and relocate all references 
to recompiled procedures, including return addresses in the 
active portion of the control stack. 

Since many optimizations benefit from profiling infor- 
mation « we have included support for profiling as an inte- 
gral part of the recompilation infrastructure. Instrumen- 
tation for profiling can t>e inserted or removed dynami- 
cally, again by regenerating code on-the-fly. A variant of 
Ball and Larus's low- overhead edge-count profiling strat- 
egy [2], extended to support first-class continuations and 
reinstrumentation of active procedures, is used to obtain 
accurate execution counts for all basic blocks in instru- 
mented code. Although profiling overhead is fairly low, 
profiling is typically enabled only for a portion of a pro- 
gram run. allowing subsequent execution to benefit from 
optizations guided by the profiling information without the 
profiling overhead. 

A side benefit of the profiling support is that profile data 
can be made available to the programmer, even as a pro- 
gram is executing. In our implementation, the programmer 
can view profile data graphically in terms of the original 
source (possibly split over many files) to identify the "hot 
spots" in a program. The source is color-coded according 
to execution frequency, and the programmer can "zoom in" 
on portions of the program or portions of the frequency 
range. 

As a proof of concept, we have used the recompila- 
tion infrastructure and profiling information to support run- 
time reordering of basic blocks to reduce the number mis- 
predicted branches and instruction cache misses, using a 
variant of Pettis and Hansen's basic-block reordering algo- 
rithm [24]. 

The mechanisms described in this paper arc directly ap- 
plicable to garbage-collected languages such as Java, ML. 
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Scheme, and Smallialk in which all references lo a proce- 
dure may be found and relocated at run time. The mech- 
anisms can be adapted to languages like C and Fortran in 
which storage management is done manually by maintain- 
ing a level of indirection for all procedure entry points and 
return addresses. Although a level of indirection is com- 
mon for entry points lo support dynamic linking and shared 
libraries, the level of indirection for return addresses would 
be a cost incurred entirely to support run-time recompila- 
tion. 

Section 2 describes the edge-count profiling algorithm 
incorporated into the dynamic recompilaiion infrastruc- 
ture and briefly discusses how profile data is associated 
with source code and presented to the programmer Sec- 
tion 3 presents the infrastructure in detail and its proof- 
of-concepi application to basic-block reordering based on 
profile infonnation. Section 4 gives some performance 
data for the profiler and dynamic recompiler. Section 5 
describes related work. Section 6 summarizes our results 
and discusses future work. 

2 Edge-Count Profiling 

This section describes a low-overhead edge-count profit- 
ing strategy based on one described by Ball and Larus (2]. 
Like Ball and Larus's, it minimizes the total number of 
profile counter increments ai run time. Unlike Bait and 
Lanis's. it supports first-class continuations and reinstru- 
mentation of active procedures. Additionally, our strategy 
employs a fast log-linear algorithm to determine optimal 
counter placement. The recompiler uses the profiling in- 
formation to guide optimization, and it may also use the 
data to optimize counter placement, as described in Sec- 
tion 3. The programmer can view the data graphically in 
terms of the original source to identify the "hot spots" in a 
- program. 

Section 2. 1 summarizes Ball and Larus's optimal edge- 
count placement algorithm. Section 2.2 presenU modifica- 
tions lo suppon first-class continuations and reinstnimen- 
taiion of active procedures. Section 2.3 describes our im- 
plementation. Section 2.4 explains how the profile dau is 
correlated with the original source. 

2.1 Backgfx>und 

Figure I illustrates Ball and Larus's optimal edge-count 
placement algorithm using a Scheme procedure that re- 
moves all occurrences of a given item from a list. 

A procedure is represented by. a control-flow graph 
composed of basic blocks and weighted edges. The as- 
sembly language instructions for the procedure are split 
into basic blocks, which arc sequential sections of code for 
which conu-ol enters only at the top and exits only from the 
bottom. The branches at the bottoms of the basic blocks 
determine how the blocks are connected, so they become 



(define remq 
(lambda (jc Is) 
(cond 

[{null? Is) '()] 

{(eg? (car Is) x) (remq x {cdr Is))] 
[else (cons (car Is) (remq x (cdr Is)))]))) 
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Figure 1 : remq source, basic blocks, sample trace, and 
control-flow graph with thick, untnstrumented edges 
from orie of the twelve maximal spanning trees and en- 
circled pOMPts for the remaining. Instrumented edges. 
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(define fact 

(lambda {n done) 
(if«n2) 
{done I) 

(♦ n (faai- n ])done))))) 
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Figure 2: Source and basic blocks for fact, a variant 
of the factorial function which Invokes the done proce* 
dure to compute the value of the base case. 
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Figure 3: Trace and control-flow graphs Illustrating 
nonlocal exit from fact with and without an exit edge. 



the edges in the graph. The weight of each edge represents 
the number of times the corresponding branch is taken. 

The key property needed for optimal profiling is conser- 
vation of flow, i.e.. the sum of the flow coming into a basic 
block is equal lo the sum of the flow going out of it. In 
order for the control -flow graph to satisfy this property, it 
must represent all possible control paths. Consequently, a 
virtual "exit" block is added so that all exiu are explicitly 
represented as edges lo the exit block. Entry to the pro- 
cedure is explicitly represented by an edge from ihe exit 
block to the entry block, and the weight of this edge repre- 
sents the number of times the procedure is invoked. 

Because the augmented control-flow graph satisfies the 
conservation of flow property, it is- not necessary to instru- 
ment all the edges. Instead, the weights for many of the 
edges can be computed froni the weights of the other edges 
using addition and subtraction, provided that the un instru- 
mented edges do not form a cycle. The largest cycle-free 
set of edges is a spanning tree. Since the sum of the 
weights of the un instrumented edges represents the sav- 
ings in counting, a maximal spanning tree determines the 
inverse of the lowest-cost set of edges' to insuvment. ' 

The edge from the exit block lo the entry block docs 
not correspond to an actual instruction in the procedure, 
so it cannot be instrumented. Consequently, the maximal 
spanning tree algorithm is seeded with this edge, and the 
resulting spanning u-ce is still maximal [2]. 



2.2 Contr*>Mlow Aberrations 

The conservation of flow property is met only under the 
assumption that each procedure call returns exactly once. 
First-class continuations, however, cause some procedure 
activations to exit prematurely and others to be reinstated 
one or more limes. Even when there are no continuations, 
reinsmimenting a procedure while it has activations on the 
stack is problematic because some of its calls have not re- 
turned yet. 

Figure 2 gives a variant of the factorial function thai 
ilJusuaies the effects of nonlocal exit and re-entry using 
Scheme's call-with-cufreHi-continuation (call/cc) func- 
tion [II]. 

Ball and Larus handle the restricted class of exit-only 
continuations, e.g., seijmpllongjmp in C^ by adding to each 
call block an edge pointing to the e.^it block. The weight 
of an exit edge represents the number of times its asso- 
ciated call exits prematurely. Figure 3 demonstrates why 
exit edges are needed for exit-only continuations. Ball and 
Larus measure the weights of exit edges directly by mod- 
. ifying longjmp to account for aborted activations. We use 
a similar strategy, but to avoid the linear stack walk over- 
head, we measure the weights indirectly by ensuring that 
exit edges are on the spanning tree (sec Section 2.3). 

With fully general continuations, the weight of an exit 
edge represents the net numtwr of premature exits, and a 
negative weight indicates reinstatement. Figure 4 demon- 
strates the utility of exit edges for continuations that rein- 
state procedure activations. 
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Figure 4: Trace and control-flow graphs illustrating re- 
entry Into fact with and without an exit edge. 

2.3 Implementation 

To support profiling, ihc compiler organizes the generated 
code for each procedure into basic blocks linked by edges 
that represent the flow of conuol within the procedure. Ii 
adds the virtual exit block and corresponding edges and 
seeds the spanning tree with the edge from the exit block 
to the entry block as described in Section 2. 1 . Ii then com- 
putes the maxima] spanning tree and assigns counters to all 
edges not on the maximal spanning Uee. Finally, the com- 
piler inserts counter increments into the generated code, 
placing them into existing blocks whenever possible. 

For efficiency, our compiler uses the priority- first search 
algorithm for finding a maximal spanning tree [27]. Its 
worst-case behavior is 0((£ + B) log 5). where B is the 
number of blocks and E is the number of edges. Since 
each block has no more than two outgoing edges. E is 
0(B). Consequently, the priority- first algorithm performs 
very well with a worst-case behavior of 0(B log B). 

Another benefit of this algorithm is that it adds uhinstru- 
mented edges to the tree in precisely the reverse order for 
which iheir weights need to be computed using the conser- 
vation of flow property. As a result, count propagation does 
not require a separate dcpth-first search as described in [2). 
Instead, the maximal spanning tree algorithm generates the 
list used to propagate the counts quickly and easily. This 
list is especially important to the garbage collector, which 
must propagate the counts of recompiled procedures (sec 
Sections 3.1 and 3.3). 
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U A's only outgoing edge is c. 
pm the increment code in A. 

If B*s only incoming edge is c 
(and B is not the exit block), 
put the increment code in B. 



Otherwise, put the incremem 
code in a new block C that is 
spliced into the graph. 



Figure 5: Efficient instrumentation of edges. 

Figure 5 illusu-ales how our compiler minimizes the 
number of additional blocks needed to increment coun- 
ters by placing as many increments as possible in existing 
blocks. The increment instructions refer to the edge data 
structures by actual address. 

Instrumenting exit edges is more difficult because there 
are no branches in the procedure associated with them. We 
solved this problem for the edge from the exit block to 
the entry block by seeding the maximal spanning tree al- 
gorithm with this edge and proving that the resulting tree 
is still maximal. Unfortunately, exit edges rarely lie on a 
maximal spanning tree because their weights are usually 
zero. Consequently, there are two choices for measuring 
the weights of exit edges. 

First, we could modify continuation invocation to up- 
date the weights directly. Ball and Larus use this technique 
for exit-only continuations by incrementing the weights of 
the exit edges associated with each activation that will exit 
prematurely [2). This approach would support fully gen- 
eral continuations if it would also decrement the weights 
of the exit edges associated with each activation that would 
be reinstated. The pointer to the exit edge would have to be 
stored either in each activation record or in a static location 
associated with the ictum address. 

Second, wc could seed the maximal spanning tr«c algo- 
rithm with ail the exit edges. The resulting spanning u-ec 
might not be maximal, but it would be maximal among 
spanning trees that include all the exit edges. 

Our system's segmented stack implementation supports 
constant-time continuation invocation [17. 5]. Implement- 
ing the first approach would destroy this property. More- 
over, if any procedure is profiled, the system must traverse 
all activations to be sure it finds all profiled ones. Although 
the second approach increases profiling overhead, it affects 
only profiled procedures, the overhead is still reasonable, 
and the programmer may turn off accurate continuation 
profiling when it is not needed. Consequently, wc imple- 
mented only the second approach. 
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Figure 6: A section of code displayed using darker shades for more frequently executed code and lighter shades 
for less frequently executed code. 



2.4 Graphical Display of Profile Information 

Although the profile data is intended priinarily for use by 
the dynamic rccompiler, il is also useful for providing feed- 
back to the programmer. In order to provide useful feed- 
back, some way of associating profile data with source 
code is needed. The compiler attaches a source record con- 
taining a source file descriptor and a byte offset from the 
beginning of the file to each source expression as it is read. 
The compiler propagates these source records through {he 
intermediate compilation passes and associates each source 
record with an appropriate basic block. Within the repre- 
sentation of a basic block, each source record associated 
with the block is included within a special "source" pseudo 
instruction. These pseudo instructions do not resuU in the 
generation of any additional machine code. 

The default location for an expression's source record 
is in the basic block that begins its evaluation. For pro* 
ccdure calls, however, the source expression corresponds 
to the basic block that makes the call. In most situations, 
the count for this block is the same as the count for the 
block that begins to evaluate the entire call expression. A 
difference arises when continuations arc involved, and in 
this case we have found it more useful to know how many 
times the call is actually made. 

Each constant and variable reference occurs in just one 
basic block, so the source record goes there. When a con- 
stant or reference occurs in nontail position, its count can 
usually be determined from the count of the closest enclos- 
ing expression. An exception arises when the constant or 
reference occurs as the "then" or "else" part of an if ex- 
pression in nontail position. Consequently, the compiler 



generates source instructions for constants and references 
only when they occur in tail position or as the "then" or 
**else" pan of an if expression. By eliminating the source 
instructions for the remaining cases, the compiler signifi- 
cantly reduces the number of source records stored in basic 
blocks without sacrificing useful information. 

To display the block counts in terms of the original 
source, we follow three steps. First, we determine the 
count for each block by summing the weights of all its 
outgoing edges. Second, we build an association list 
of source expressions and block counts from the source 
records stored in each basic block. Third, we use this list 
to determine the source files that must be displayed by the 
graphical user interface and to guide a color-coding of the 
code according to the counts. The programmer can "zoom 
in" on portions of the program or portions of the frequency 
range and can also click on an expression to obtain its pre- 
cise count. Figure 6 gives an example. 

The programmer is not limited to displaying informa- 
tion for one procedure at a time. The data for all profiled 
procedures can be displayed at one time with a separate 
window for each source file; The data is sorted by fre- 
quency to help prograrnmers identify hot spots. This tech- 
nique proved useful in profiling the profiler itself, helping 
us identify inefficiencies in the maximal spanning tree and 
block look-up algorithms. 

3 Dynamic RecompUation 

This section presents the dynamic recompilation infras- 
tructure. Section 3.1 gives an overview of the recompila- 
tion process. Section 3.2 describes how the representation 
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of procedures is modified to accommodate dynamic re- 
compilation. Section 3.3 describes how the garbage collec- 
tor uses the modified representation to replace code objects 
with recompiled ones. Section 3.4 presents the recompila- 
tion process, which uses a variant of Penis and Hansen's 
basic-block reordering algorithm to reduce the number of 
mispredicted branches and instruction cache misses (24]. 

3.1 Overview 

Dynamic recompilation proceeds in three phases. First, the 
candidate procedures are identified, either by the user or 
by a program thai selects among all the procedures in the 
heap. Second, these procedures are recompiled and linked 
to separate, new procedures. Third, the original procedures 
arc replaced by ihe new ones during the next garbage col- 
lection. 

Because a procedure's entry and return points may 
change during recompilation, the recompiler creates a 
translation table that associates the entry- and return-point 
offsets of the original and the new procedure and attaches 
it to the original procedure. The collector uses this table 
to relocate call instructions and return addresses. Because 
the collector translates return addresses, procedures can be 
recompiled while they are executing. 

Before the next collection, only the original procedures 
are used. This invariant allows each original procedure to 
share its control- flow graph and associated profile counts, 
if profiling is enabled, with the new procedure. Because 
the new procedure's maximal spanning tree may be differ- 
ent from the original's, the new procedure may increment 
different counts. The collector accounts for the difference 
by propagating the counts of the original procedure so that 
the new procedure starts with a complete set of accurate 
counts. 

3.2 Representation 

Figure 7 illustrates our representation of procedures, high- 
lighting the minor changes needed to support dynamic re- 
compilation. A more detailed description of our object rep- 
resentation is given elsewhere [14]. 

Procedures are represented as closures and c,ode objects. . 
A closure is a variable- length array whose first element 
is the address of the procedure's anonynious eniiy point 
and whose remaining elements are the values of the proce- 
dure's free variables. The anonymous entry point is the 
first instruction of the procedure's machine code, which 
is stored in a code object: A code object has a six-word 
header before the machine code. The info field stores the 
code object's block structure and debugging information. 

The relocation table is used by the garbage collector to 
relocate items stored in the instruction stream. Each item 
has an entry that specifies the item's offset within the code 
stream (the code offset), the offset from the item's address 




to the address actually stored in the code stream (the item 
ofTsci). and how the item is encoded in the code stream (the 
type). The code pointer is used to relocate items stored as 
relative addresses. 

Because procedure entry points may change during re- 
compilation, relocation entries for calls use the long format 
so that the translated offset cannot exceed the field width. 
Consequently, the long format provides a convenient lo- 
cation for the "d" bit. which indicates whether an entry 
corresponds to a call instruction. Use of the long format 
does not isignificantly increase the table size because calls 
account for a minority of relocation entries. Because short 
entries cannot describe calls, they need not be checked. 

Three of the code object's type bits are used to encode 
the status of a code object. The first bit. set by the recom- 
piler, indicates whether the code object has been recom- 
piled to a new one and is thus obsolete. Since an obsolete 
code object will not be copied during collection, its relo- 
cation table is no longer needed. Therefore, its rcloc field 
is used to point to the translation table instead. The list 
of original/new offset pairs is sorted by original offset to 
enable fast binary searching during relocation. 

The second bit, denoted by **n" in the figure, indicates 
whether the code object is new. This bit. set by the recom- 
piler and cleared by the collector, is used to prevent a new 
code object from being recompiled before the associated 
obsolete one has been eliminated. The third bit, denoted 
by "b" in the figure, serves a similar purpose. It indicates 
whether an old code object is busy in that it is either being 
or has been recompiled. This bit is used to prevent multi- 
ple recompilation of the same code object. The recompiler 
sets this bit at the beginning of recompilation. Because the 
recompiler may trigger a garbage collection before it cre- 
ates a new code object and marks the original one obsolete, 
this bit must be preserved by the collector. It is possible to 
recompile a code object multiple times; however, each sub- 
sequent recompilation must occur after the code object has 
been recompiled and collected. (It would be possible to 

-relax this restriction.) 

In order to support multiple return values, first-class 
continuations, and garbage collection efficiently, four 

■ words of dau are: placed sn the instruction stream imme- 
diately before the single- value return point from each non- 
tail call (17. 1,5]. The live mask is a bit vector describing 
which frame locations contain live data. The code pointer 
is used to find the code object associated with a given re- 
turn address. The frame size is used during garbage collec- 
tion, continuation invocation, and debugging to walk down 
a stack one frame at a time. 

33 Collection 

Only a few modifications are necessary for the garbage 
collector to support dynamic recompilation. The primary 



iOOCIO: <XP Be3603A_l_> 



245 



unused |0|n|b|^;biyp,|tOI01IIO 



closure: 



stack: 




short 
reloc entry: 



long 
icloc enuy: 



size 


code 


rcloc 
entry 1 


a 


rcloc 1 














0|xx| 


S-bit 1 12-bit 
type 1 code offset 


1 2-bit 
item offset 
















I|xx| 


5-bit 1 
type 1 


unused 




unused j d 




item offset 




code ofTssi 





unused |l|Q|l|^^ffp^|lOIOino 



obsolete 
code: 



1 »ypc 


length 


trans 
table 


name 


free 
count 


info 


inst 


a 


inst 


















4 size 


new 
code 


ong 
offset) 


new 
offset 1 




ong 
offsetf 


new 
ofTseir 





Figure 7: Representation of closures, code objects, and stacks with the infrastructure for dynamic recompliotion 
highiighted. 



change involves how a code object is copied. If the obso- 
ieie bit is ciear, the code object and associated relocation 
table are copied as usual, but the new bit of the copied code 
object's type field is cleared if set. 

If the obsolete bit is set, the code object is not copied at 
ail. Instead, the pointer to the new code object is relocated 
and stored as the forwarding address for the obsolete code 
object so that all other pointers to the obsolete code object 
will be forwarded to the new one. Moreover, if the obso- 
lete code object is instrumented for profiling, ihc^ collector 
propagates the counts so that they remain accurate when 
the new code object increments a possibly difTcrcsJt subset 
of the counts. The list of edge/block pasrs used for prop- 
agation is found in the info structure. Since the propaga- 
tion occurs during collection, some of the objects in the list 
may be forwarded, so the collector must check for pointers 
to forwarded objects as it propagates the counts. 

The translation table is used to reloc^le call instructions 
and return addresses. Call instructions are foimd only in 
code objects, so they are handled v/hsn code objects are 
swept. The "d" bit of long-format relocation cnuies iden- 
tifies the candidates for translation. Return addresses arc 
found only in stacks, so they are handled when continua- 
tions are swept. 



Since our collector is generational, we must address the 
problem of potential cross-generational pointers from ob- 
solete to new code objects. Our segmented heap model al- 
lows us to allocate new objects in older generations when 
necessary [14]; thus, we always allocate new code objects 
in the same generation as the corresponding obsolete code 
objects. This allows us to keep the code space free of cross- 
generational pointers from older to newer objects. 

3.4 Block Reorderisag 

To demonsuate the dynamic recompilation infrastructure, 
we have applied it to the problem of basic-block reorder- 
ing. We use a variant of Pettis and Hansen's basic-block 
reordering algorithm to reduce the number of mispredicted 
branches and insuvction cache misses [24], We also use 
tlie edge-count profile data to decrease profiting overhead 
by re-running the maximal spanning tree algorithm to im- 
prove counter placement. 

The block reordering algorithm proceeds in two steps. 
First, blocks are combined into chains according to the 
most frequently executed edges to reduce the number of 
instruction cache misses. Second, the chains are ordered to 
reduce the number of mispredicted branches. 

Initially, every block comprises a chain of one block, it- 
self. Using a list of edges sorted by decreasing weight, dis- 
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Figure 8: Adding branch prediction preferences for architectures that predict backward conditional branches taken 
and forward conditional branches not taken. 



tinct chains A and B are combined when an edge's source 
block is the tail of A and its sink is the head of B. When 
all the edges have been processed, a set of chains is left. 

The algorithm places ordering preferences on the chains 
based on the conditional branches emitted from the blocks 
within the chains and the target architecture's branch pre- 
diction strategy. Blocks with two outgoing edges always 
generate a conditional branch for one of the edges, and ihey 
generate an unconditional branch when the other edge does 
not point to the next block in the chain. Figure 8 illusu-ates 
how the various conditional branch possibilities generate 
preferences for a common prediction strategy. The pref- 
erences are implemented using a weighted directed graph 
with nodes representing chains and edges representing the 
"should come before" relation. 

As each block with two outgoing edges is processed, its 
preferences (if any) are added to the weights of the graph. 
Suppose there is an edge of weight x from chain A lo B 
and an edge of weight y from chain B to ^4. and x > y. 
The second edge is removed, and the first edge's weight 
becomes x - y,so that there is only one positive- weighted 
edge between any two nodes. A depth- first search then 
topblogically sorts the chains, omitting edges that cause 
cycles. The machine code for the chains is placed in a new 
code object, and the old code object is marked obsolete and 
has its reloc field changed to point to the translation table. 

4 Performance 

To assess both compile-lime and run-time performance 
of the dynamic recompiler and profiler, we used a set of 
benchmarks comprised of Chez Scheme S.Og recompiling 
itself. Soft Scheme [29] checking its pattern matcher. Dig- 
ital Design Derivation System 1.0 [4] deriving a Scheme 
CPU [6], Similix 5.0 partially evaluating itself (3], and the 
Gambii-C 2.3. J benchmark suite. All measurements were 
taken on a DEC Alpha 3000/600 running Digital UhOX 
V4.0A and are reported in detail elsewhere [7]. 



Initial instrumentation has an average run-time over- 
head of 50% and an average compile -time overhead of 
1 1 %. Without block reordering, optimal count placement 
reduces the average run-time overhead to 37%, and the 
average recompile time is only 15% of the base compile 
lime. With block reordering, the average run -time over- 
head drops to 27%. while the average recompile time in- 
creases very slightly to 16%, 

The reduction of profiling overhead from 50% to 37% 
is quite respectable and shows that optimal counter place- 
ment can be an effective tool for use in long program 
runs intended to gather precise overall profiling infor- 
mation. Moreover, profile-driven run-time optimizations 
such as block reordering further reduce profiling overhead. 
Nonetheless, optimized code will run faster with profiling 
disabled. This suggests a su-ategy for dynamic optimiza- 
tion in which a program is dynamically optimized and pro- 
filing instrumentation removed after an initial portion of 
a program run during which profiling information is gath- 
ered. In addition, profiling can be enabled for particular 
procedures at various times during a run to provide a sort 
of statistical sampling, although this would likely have to 
be done at the programmer's direction. 

To assess the effectiveness of the block -reordering al- 
gorithm, we measured the number of mispredicted condi- 
tional branches and the nijinber of unconditional branches. 
The Alpha architecture encourages hardware and compiler 
implementors lo predict backward conditional branches 
taken and fcr-vard condiiional branches not taken [28]. 
Current Alpha implementations use this static model as 
a starting point for dynamic branch prediction. We com- 
puted the mispredicted branch percenuge using the static 
model as a metric for determinirig the effectiveness of the 
block-reordering algorithm. Like Pettis and Hansen's algo- 
rithm, ours achieves near- optimal misprediction rates, sig- 
nificantly reduces the number of unconditional branches, 
and provides a modest 6% reduction in run time. Our al- 
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gorithm is also fast, for the average recompile time is just 
12% of the base compile time. Because the Alpha per- 
forms dynamic branch prediction, the 6% reduction in run 
time is likely due mostly to a reduction in instruction cache 
misses. 

5 Related Work 

Our edge-count profiling algorithm is based on one de- 
scribed by Ball and Larus [2], who applied their algo- 
rithm in a traditional static compilation environment. We 
have extended their algorithm to support first-class con- 
tinuations and reinstrumentation of active procedures. We 
have also identified an algorithm for determining optimal 
counter placement that is faster and eliminates the need for 
a separate depth-first search for count propagation. 

We have used Pettis and Hansen's intraproccdural 
block-reordering strategy with only minor modifications to 
apply it in the context of dynamic recompilation. Sam- 
ples [26] explores a similar intraprocedural algorithm that 
reduces insU^ction cache miss rates by up to 50%. Pettis 
and Hansen [24] also describe an interprocedural algorithm 
that places procedures in memory such that those that ex- 
ecute close together in lime will also be close together in 
memory. Since determining the optimal ordering is NP- 
compleie, they use a greedy "closest is best" suategy. 

Several recent research projects have focused on light- 
weight run-time code generation [8, 15. 22, 20, 25], with 
code generation costs on the order of five to 100 insu^uc- 
tions executed for each instruction generated. Although 
the overhead inherent in our model is greater, the potential 
benefits are greater as well. 

Little attention has previously been paid to profile- 
driven dynamic optimizations, with the notable exception 
of work on Self [9, 1 8]. This work uses a specialized form 
of profiling to guide generation of special-pui-posc code to 
avoid generic dispatch overhead. Our infrastructure incor- 
porates a more general profiling strategy that is not targeted 
to any particular optimization technique, although we have 
so far applied it only to the limited problem of block re- 
ordering. 

Keppel has investigated the application of nin-time code 
generation to value-specific optimizations, in which code is 
special-cased to particular input values [19]. Although we 
have so far focused on profile-driven optimizations; we be- 
lieve that our infrastructure is' well suited to value-specific 
optimizations as well. 

6 Conclusions 

We have described an efficient infrastructure for dynamic 
recompilation that incorporates a low-overhead edge-count 
profiling strategy supporting first-class continuations and 
reinstrumentation of active procedures. We have shown 



how the profile data can be used to improve performance 
by reordering basic blocks and optimizing counter place- 
ment. In addition, we have explained how the profile data 
can be associated with the original source to provide graph- 
ical feedback to the programmer. 

The mechanisms described in this paper have all been 
implemented and incorporated into Chez Scheme. The re- 
compiler can profile and regenerate all code in the system, 
including itself, as it runs. Performance results show that 
the average run -time profiling overhead is initially 50% 
and decreases significantly with recompilation. The av- 
erage compile-time profiling overhead is 11%, and the av- 
erage recompile time relative to the base compile time is 
12% without profiling instrumentation and 15-16% with 
profiling instrumentation. The block -reordering algorithm 
is fast and effective at reducing the number of mispredicted 
branches and unconditional branches. 

The techniques and algorithms are directly applicable 
to garbage-collected languages such as Java, ML, Scheme, 
and Smalltalk, and can be adapted to other languages as 
described iri Section 1 . 

Dynamic recompilation need not be limited to low-level 
optimizations such as block reordering. A promising area 
of future work involves associating the profile data with 
earlier passes of the compiler so that higher-level opttmiza- 
iions can take advantage of the information. For example, 
register allocation, flow analysis, and inlining could ben- 
efit from profile data. An unoptimized active code seg- 
ment with no analogue in the optimized version of a pro- 
gram, such as code for an active procedure that has been 
in lined at its call sites, can be retained by the system un- 
til no longer needed. This will, in fact, occur naturally in 
garbage-collected systems. 

Another area of future work involves a generalization 
of edge-count profiling to measure other dynamic program 
characteristics such as the number of procedure calls, vari- 
able references, and so forth. For example, profile data can 
be used to measure how many times a procedure is called 
versus how many times it is created, and this ratio could be 
used to guide lambda lifting ( 1 2}. Combined with an esti- 
mate of the cost of gienerating code for the procedure, this 
ratio could also help determine when run-time code gen- 
eration [22] would be profitable. Our system can also be 
extended to recompile (specialize) a closure based on the 
run-time values of its free variables. 
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